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Preface 


This is an introductory textbook of linear programming, written mainly for 
students of computer science and mathematics. Our guiding phrase is, “what 
every theoretical computer scientist should know about linear programming.” 

The book is relatively concise, in order to allow the reader to focus on 
the basic ideas. For a number of topics commonly appearing in thicker books 
on the subject, we were seriously tempted to add them to the main text, but 
we decided to present them only very briefly in a separate glossary. At the 
same time, we aim at covering the main results with complete proofs and in 
sufficient detail, in a way ready for presentation in class. 

One of the main focuses is applications of linear programming, both in 
practice and in theory. Linear programming has become an extremely flex- 
ible tool in theoretical computer science and in mathematics. While many 
of the finest modern applications are much too complicated to be included 
in an introductory text, we hope to communicate some of the flavor (and 
excitement) of such applications on simpler examples. 

We present three main computational methods. The simplex algorithm is 
first introduced on examples, and then we cover the general theory, putting 
less emphasis on implementation details. For the ellipsoid method we give 
the algorithm and the main claims required for its analysis, omitting some 
technical details. From the vast family of interior point methods, we concen- 
trate on one of the most efficient versions known, the primal—dual central 
path method, and again we do not present the technical machinery in full. 
Rigorous mathematical statements are clearly distinguished from informal 
explanations in such parts. 

The only real prerequisite to this book is undergraduate linear algebra. 
We summarize the required notions and results in an appendix. Some of the 
examples also use rudimentary graph-theoretic terminology, and at several 
places we refer to notions and facts from calculus; all of these should be a 
part of standard undergraduate curricula. 


Errors. If you find errors in the book, especially serious ones, we would 
appreciate it if you would let us know (email: matousek@kam.mff.cuni.cz, 
gaertner@inf.ethz.ch). We plan to post a list of errors at http://www. 
inf .ethz.ch/personal/gaertner/l1pbook. 
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1. What Is It, and What For? 


Linear programming, surprisingly, is not directly related to computer pro- 
gramming. The term was introduced in the 1950s when computers were few 
and mostly top secret, and the word programming was a military term that, 
at that time, referred to plans or schedules for training, logistical supply, 
or deployment of men. The word linear suggests that feasible plans are re- 
stricted by linear constraints (inequalities), and also that the quality of the 
plan (e.g., costs or duration) is also measured by a linear function of the 
considered quantities. In a similar spirit, linear programming soon started 
to be used for planning all kinds of economic activities, such as transport 
of raw materials and products among factories, sowing various crop plants, 
or cutting paper rolls into shorter ones in sizes ordered by customers. The 
phrase “planning with linear constraints” would perhaps better capture this 
original meaning of linear programming. However, the term linear program- 
ming has been well established for many years, and at the same time, it has 
acquired a considerably broader meaning: Not only does it play a role only 
in mathematical economy, it appears frequently in computer science and in 
many other fields. 


1.1 A Linear Program 


We begin with a very simple linear programming problem (or linear pro- 
gram for short): 


Maximize the value r+ 22 
among all vectors (71,22) € R? 
satisfying the constraints r1>0 
v2 > 0 
LQ XY 1 
2, + 6% < 15 


4x, — xr < 10. 


For this linear program we can easily draw a picture. The set {x € R? : 
“2 — x1 < 1} is the half-plane lying below the line 22 = 2; +1, and similarly, 
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each of the remaining four inequalities defines a half-plane. The set of all 
vectors satisfying the five constraints simultaneously is a convex polygon: 


x2 


t+ 62rq < 15 


Ary, — % < 10 


Which point of this polygon maximizes the value of 71 + 72? The one lying 
“farthest in the direction” of the vector (1,1) drawn by the arrow; that is, 
the point (3,2). The phrase “farthest in the direction” is in quotation marks 
since it is not quite precise. To make it more precise, we consider a line 
perpendicular to the arrow, and we think of translating it in the direction of 
the arrow. Then we are seeking a point where the moving line intersects our 
polygon for the last time. (Let us note that the function 7 + x2 is constant 
on each line perpendicular to the vector (1,1), and as we move the line in 
the direction of that vector, the value of the function increases.) See the next 
illustration: 


wy+a.=4 TW+%Q2=5 


t+%2=2 
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In a general linear program we want to find a vector x* € R” maximizing 
(or minimizing) the value of a given linear function among all vectors x € R” 
that satisfy a given system of linear equations and inequalities. The linear 
function to be maximized, or sometimes minimized, is called the objective 
function. It has the form ce? x = c)z1 +-:-+ Cnn, where c € R” isa given 
vector. 

The linear equations and inequalities in the linear program are called the 
constraints. It is customary to denote the number of constraints by m. 

A linear program is often written using matrices and vectors, in a way 
similar to the notation Ax = b for a system of linear equations in linear 
algebra. To make such a notation simpler, we can replace each equation in 
the linear program by two opposite inequalities. For example, instead of the 
constraint x7; + 322 = 7 we can put the two constraints 71; + 3%) < 7 and 
z, + 342 > 7. Moreover, the direction of the inequalities can be reversed 
by changing the signs: 7; + 3x2 > 7 is equivalent to —a2, — 3%2 < —7, and 
thus we can assume that all inequality signs are “<”, say, with all variables 
appearing on the left-hand side. Finally, minimizing an objective function 
c’ x is equivalent to maximizing —c’x, and hence we can always pass to a 
maximization problem. After such modifications each linear program can be 
expressed as follows: 


Maximize the value of ce? x 


among all vectors x € R” satisfying Ax <b, 


where A is a given mxn real matrix and c € R”, b € R™ are given vectors. 
Here the relation < holds for two vectors of equal length if and only if it 
holds componentwise. 

Any vector x € R” satisfying all constraints of a given linear program is 
a feasible solution. Each x* € R” that gives the maximum possible value 
of c’x among all feasible x is called an optimal solution, or optimum for 
short. In our linear program above we have n = 2, m = 5, and c = (1,1). 
The only optimal solution is the vector (3,2), while, for instance, (2, 3) isa 
feasible solution that is not optimal. 

A linear program may in general have a single optimal solution, or in- 
finitely many optimal solutions, or none at all. 

We have seen a situation with a single optimal solution in the first example 
of a linear program. We will present examples of the other possible situations. 


' Here we regard the vector c as an nX1 matrix, and so the expression c’x is a 
product of a 1xn matrix and an nx1 matrix. This product, formally speaking, 
should be a 1x1 matrix, but we regard it as a real number. 

Some readers might wonder: If we consider c a column vector, why, in the 
example above, don’t we write it as a column or as (1, 1)7? For us, a vector 
is an n-tuple of numbers, and when writing an explicit vector, we separate the 
numbers by commas, as in c = (1,1). Only if a vector appears in a context where 
one expects a matrix, that is, in a product of matrices, then it is regarded as (or 
“converted to”) an nx1 matrix. (However, sometimes we declare a vector to be 
a row vector, and then it behaves as a 1xn matrix.) 
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If we change the vector c in the example to (4, 1), all points on the side 
of the polygon drawn thick in the next picture are optimal solutions: 


%—- 24, <1 


%1+6x%2 < 15 


r1>0 


404 — x2 < 10 


If we reverse the directions of the inequalities in the constraints 72 — 7, < 1 
and 4%; — #2 < 10 in our first example, we obtain a linear program that has 
no feasible solution, and hence no optimal solution either: 


%1+ 6x2 < 15 


Any — x2 > 10 


Such a linear program is called infeasible. 

Finally, an optimal solution need not exist even when there are feasible 
solutions. This happens when the objective function can attain arbitrarily 
large values (such a linear program is called unbounded). This is the case 
when we remove the constraints 4%, — x2 < 10 and 2; + 622 < 15 from the 
initial example, as shown in the next picture: 


ol 
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T— 24, <1 


La 
(iat) nes 


ANN 


(0, 0) Ly > 0 


Ls 


Let us summarize: We have seen that a linear program can have one or 
infinitely many optimal solutions, but it may also be unbounded or infeasible. 
Later we will prove that no other situations can occur. 

We have solved the initial linear program graphically. It was easy since 
there are only two variables. However, for a linear program with four variables 
we won’t even be able to make a picture, let alone find an optimal solution 
graphically. A substantial linear program in practice often has several thou- 
sand variables, rather than two or four. A graphical illustration is useful for 
understanding the notions and procedures of linear programming, but as a 
computational method it is worthless. Sometimes it may even be mislead- 
ing, since objects in high dimension may behave in a way quite different from 
what the intuition gained in the plane or in three-dimensional space suggests. 

One of the key pieces of knowledge about linear programming that one 
should remember forever is this: 


A linear program is efficiently solvable, both in theory and in practice. 


e In practice, a number of software packages are available. They can han- 
dle inputs with several thousands of variables and constraints. Linear 
programs with a special structure, for example, with a small number of 
nonzero coefficients in each constraint, can often be managed even with 
a much larger number of variables and constraints. 

e In theory, algorithms have been developed that provably solve each linear 
program in time bounded by a certain polynomial function of the input 
size. The input size is measured as the total number of bits needed to 
write down all coefficients in the objective function and in all the con- 
straints. 


These two statements summarize the results of long and strenuous research, 
and efficient methods for linear programming are not simple. 

In order that the above piece of knowledge will also make sense forever, 
one should not forget what a linear program is, so we repeat it once again: 
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A linear program is the problem of maximizing a given linear function 
over the set of all vectors that satisfy a given system of linear equations 
and inequalities. Each linear program can easily be transformed to the 
form 


maximize c’x subject to Ax < b. 


1.2 What Can Be Found in This Book 


The rest of Chapter 1 briefly discusses the history and importance of linear 
programming and connects it to linear algebra. 

For a large majority of readers it can be expected that whenever they 
encounter linear programming in practice or in research, they will be using it 
as a black box. From this point of view Chapter 2 is crucial, since it describes 
a number of algorithmic problems that can be solved via linear programming. 

The closely related Chapter 3 discusses integer programming, in which 
one also optimizes a linear function over a set of vectors determined by linear 
constraints, but moreover, the variables must attain integer values. In this 
context we will see how linear programming can help in approximate solutions 
of hard computational problems. 

Chapter 4 brings basic theoretical results on linear programming and on 
the geometric structure of the set of all feasible solutions. Notions introduced 
there, such as convexity and convex polyhedra, are important in many other 
branches of mathematics and computer science as well. 

Chapter 5 covers the simplex method, which is a fundamental algorithm 
for linear programming. In full detail it is relatively complicated, and from 
the contemporary point of view it is not necessarily the central topic in a first 
course on linear programming. In contrast, some traditional introductions to 
linear programming are focused almost solely on the simplex method. 

In Chapter 6 we will state and prove the duality theorem, which is one 
of the principal theoretical results in linear programming and an extremely 
useful tool for proofs. 

Chapter 7 deals with two other important algorithmic approaches to linear 
programming: the ellipsoid method and the interior point method. Both of 
them are rather intricate and we omit some technical issues. 

Chapter 8 collects several slightly more advanced applications of linear 
programming from various fields, each with motivation and some background 
material. 

Chapter 9 contains remarks on software available for linear programming 
and on the literature. 

Linear algebra is the main mathematical tool throughout the book. The 
required linear-algebraic notions and results are summarized in an appendix. 

The book concludes with a glossary of terms that are common in linear 
programming but do not appear in the main text. Some of them are listed to 
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ensure that our index can compete with those of thicker books, and others 
appear as background material for the advanced reader. 


Two levels of text. This book should serve mainly as an introductory text 
for undergraduate and early graduate students, and so we do not want to 
assume previous knowledge beyond the usual basic undergraduate courses. 
However, many of the key results in linear programming, which would be a 
pity to omit, are not easy to prove, and sometimes they use mathematical 
methods whose knowledge cannot be expected at the undergraduate level. 
Consequently, the text is divided into two levels. On the basic level we are 
aiming at full and sufficiently detailed proofs. 


The second, more advanced, and “edifying” level is typographically 
distinguished like this. In such parts, intended chiefly for mathemati- 
cally more mature readers, say graduate or PhD students, we include 
sketches of proofs and somewhat imprecise formulations of more ad- 
vanced results. Whoever finds these passages incomprehensible may 
freely ignore them; the basic text should also make sense without them. 


1.3 Linear Programming and Linear Algebra 


The basics of linear algebra can be regarded as a theory of systems of linear 
equations. Linear algebra considers many other things as well, but systems 
of linear equations are surely one of the core subjects. A key algorithm is 
Gaussian elimination, which efficiently finds a solution of such a system, and 
even a description of the set of all solutions. Geometrically, the solution set 
is an affine subspace of R”, which is an important linear-algebraic notion.” 

In a similar spirit, the discipline of linear programming can be regarded 
as a theory of systems of linear inequalities. 


In a linear program this is somewhat obscured by the fact that we 
do not look for an arbitrary solution of the given system of inequalities, 
but rather a solution maximizing a given objective function. But it 
can be shown that finding an (arbitrary) feasible solution of a linear 
program, if one exists, is computationally almost equally difficult as 
finding an optimal solution. Let us outline how one can gain an optimal 
solution, provided that feasible solutions can be computed (a different 
and more elegant way will be described in Section 6.1). If we somehow 
know in advance that, for instance, the maximum value of the objective 
function in a given linear program lies between 0 and 100, we can first 
ask, whether there exists a feasible x € R” for which the objective 


? An affine subspace is a linear subspace translated by a fixed vector x € R”. For 
example, every point, every line, and R? itself are the affine subspaces of R?. 
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function is at least 50. That is, we add to the existing constraints a 
new constraint requiring that the value of the objective function be 
at least 50, and we find out whether this auxiliary linear program 
has a feasible solution. If yes, we will further ask, by the same trick, 
whether the objective function can be at least 75, and if not, we will 
check whether it can be at least 25. A reader with computer-science- 
conditioned reflexes has probably already recognized the strategy of 
binary search, which allows us to quickly localize the maximum value 
of the objective function with great accuracy. 


Geometrically, the set of all solutions of a system of linear inequalities 
is an intersection of finitely many half-spaces in R”. Such a set is called a 
convex polyhedron, and familiar examples of convex polyhedra in R? are a 
cube, a rectangular box, a tetrahedron, and a regular dodecahedron. Con- 
vex polyhedra are mathematically much more complex objects than vector 
subspaces or affine subspaces (we will return to this later). So actually, we 
can be grateful for the objective function in a linear program: It is enough to 
compute a single point x* € R” as a solution and we need not worry about 
the whole polyhedron. 

In linear programming, a role comparable to that of Gaussian elimination 
in linear algebra is played by the simplex method. It is an algorithm for 
solving linear programs, usually quite efficient, and it also allows one to prove 
theoretical results. 

Let us summarize the analogies between linear algebra and linear pro- 
gramming in tabular form: 


| ‘| - Basic problem Algorithm — Solution set 
Linear system of Gaussian affine 

linear equations elimination subspace 
Linear system of simplex convex 

linear inequalities method polyhedron 


1.4 Significance and History of Linear Programming 


In a special issue of the journal Computing in Science & Engineering, the 
simplex method was included among “the ten algorithms with the greatest 
influence on the development and practice of science and engineering in the 
20th century.”? Although some may argue that the simplex method is only 


3 The remaining nine algorithms on this list are the Metropolis algorithm for 
Monte Carlo simulations, the Krylov subspace iteration methods, the decompo- 
sitional approach to matrix computations, the Fortran optimizing compiler, the 
QR algorithm for computing eigenvalues, the Quicksort algorithm for sorting, 
the fast Fourier transform, the detection of integer relations, and the fast multi- 
pole method. 
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number fourteen, say, and although each such evaluation is necessarily sub- 
jective, the importance of linear programming can hardly be cast in doubt. 

The simplex method was invented and developed by George Dantzig in 
1947, based on his work for the U.S. Air Force. Even earlier, in 1939, Leonid 
Vitalyevich Kantorovich was charged with the reorganization of the timber 
industry in the U.S.S.R., and as a part of his task he formulated a restricted 
class of linear programs and a method for their solution. As happens under 
such regimes, his discoveries went almost unnoticed and nobody continued his 
work. Kantorovich together with Tjalling Koopmans received the Nobel Prize 
in Economics in 1975, for pioneering work in resource allocation. Somewhat 
ironically, Dantzig, whose contribution to linear programming is no doubt 
much more significant, was never awarded a Nobel Prize. 

The discovery of the simplex method had a great impact on both the- 
ory and practice in economics. Linear programming was used to allocate 
resources, plan production, schedule workers, plan investment portfolios, and 
formulate marketing and military strategies. Even entrepreneurs and man- 
agers accustomed to relying on their experience and intuition were impressed 
when costs were cut by 20%, say, by a mere reorganization according to 
some mysterious calculation. Especially when such a feat was accomplished 
by someone who was not really familiar with the company, just on the basis 
of some numerical data. Suddenly, mathematical methods could no longer be 
ignored with impunity in a competitive environment. 

Linear programming has evolved a great deal since the 1940s, and new 
types of applications have been found, by far not restricted to mathematical 
economics. 

In theoretical computer science it has become one of the fundamental tools 
in algorithm design. For a number of computational problems the existence 
of an efficient (polynomial-time) algorithm was first established by general 
techniques based on linear programming. 

For other problems, known to be computationally difficult (NP-hard, if 
this term tells the reader anything), finding an exact solution is often hope- 
less. One looks for approximate algorithms, and linear programming is a key 
component of the most powerful known methods. 

Another surprising application of linear programming is theoretical: the 
duality theorem, which will be explained in Chapter 6, appears in proofs of 
numerous mathematical statements, most notably in combinatorics, and it 
provides a unifying abstract view of many seemingly unrelated results. The 
duality theorem is also significant algorithmically. 

We will show examples of methods for constructing algorithms and proofs 
based on linear programming, but many other results of this kind are too 
advanced for a short introductory text like ours. 

The theory of algorithms for linear programming itself has also grown con- 
siderably. As everybody knows, today’s computers are many orders of mag- 
nitude faster than those of fifty years ago, and so it doesn’t sound surprising 
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that much larger linear programs can be solved today. But it may be sur- 
prising that this enlargement of manageable problems probably owes more to 
theoretical progress in algorithms than to faster computers. On the one hand, 
the implementation of the simplex method has been refined considerably, and 
on the other hand, new computational methods based on completely differ- 
ent ideas have been developed. This latter development will be described in 
Chapter 7. 


2. Examples 


Linear programming is a wonderful tool. But in order to use it, one first 
has to start suspecting that the considered computational problem might be 
expressible by a linear program, and then one has to really express it that 
way. In other words, one has to see linear programming “behind the scenes.” 

One of the main goals of this book is to help the reader acquire skills in 
this direction. We believe that this is best done by studying diverse examples 
and by practice. In this chapter we present several basic cases from the wide 
spectrum of problems amenable to linear programming methods, and we 
demonstrate a few tricks for reformulating problems that do not look like 
linear programs at first sight. Further examples are covered in Chapter 3, 
and Chapter 8 includes more advanced applications. 

Once we have a suitable linear programming formulation (a “model” in 
the mathematical programming parlance), we can employ general algorithms. 
From a programmer’s point of view this is very convenient, since it suffices to 
input the appropriate objective function and constraints into general-purpose 
software. 

If efficiency is a concern, this need not be the end of the story. Many prob- 
lems have special features, and sometimes specialized algorithms are known, 
or can be constructed, that solve such problems substantially faster than 
a general approach based on linear programming. For example, the study 
of network flows, which we consider in Section 2.2, constitutes an extensive 
subfield of theoretical computer science, and fairly efficient algorithms have 
been developed. Computing a maximum flow via linear programming is thus 
not the best approach for large-scale instances. 

However, even for problems where linear programming doesn’t ultimately 
yield the most efficient available algorithm, starting with a linear program- 
ming formulation makes sense: for fast prototyping, case studies, and deciding 
whether developing problem-specific software is worth the effort. 
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2.1 Optimized Diet: Wholesome and Cheap? 


...and when Rabbit said, “Honey or condensed milk 
with your bread?” he was so excited that he said, 
“Both,” and then, so as not to seem greedy, he added, 
“But don’t bother about the bread, please.” 


A.A. Milne, Winnie the Pooh 


The Office of Nutrition Inspection of the EU recently found out that dishes 
served at the dining and beverage facility “Bullneck’s,” such as herring, hot 
dogs, and house-style hamburgers do not comport with the new nutritional 
regulations, and its report mentioned explicitly the lack of vitamins A and 
C and dietary fiber. The owner and operator of the aforementioned facility 
is attempting to rectify these shortcomings by augmenting the menu with 
vegetable side dishes, which he intends to create from white cabbage, carrots, 
and a stockpile of pickled cucumbers discovered in the cellar. The following 
table summarizes the numerical data: the prescribed amount of the vitamins 
and fiber per dish, their content in the foods, and the unit prices of the foods.! 


Carrot, White Cucumber, || Required 
Raw Cabbage, Raw Pickled per dish 


Vitamin A [mg/kg] 
Vitamin C [mg/kg] 
Dietary Fiber [g/kg] 


price [€ /kg] 0.75 


“Residual accounting price of the inventory, most likely unsaleable. 


At what minimum additional price per dish can the requirements of the 
Office of Nutrition Inspection be satisfied? This question can be expressed 
by the following linear program: 


Minimize 0.752, + 0.5%2 + 0.1523 
subject to 21 >0 
v2 ea 0 
X3 = 0 
352, + 0.542 + 0.543 > 0.5 
602, + 300zx2 + 1023 > 15 
3021 a 2022 + 1023 > 4. 


The variable x; specifies the amount of carrot (in kg) to be added to each dish, 
and similarly for x2 (cabbage) and x3 (cucumber). The objective function 


' For those interested in healthy diet: The vitamin contents and other data are 
more or less realistic. 
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expresses the price of the combination. The amounts of carrot, cabbage, and 
cucumber are always nonnegative, which is captured by the conditions x, > 0, 
xq > 0, x3 > 0 (if we didn’t include them, an optimal solution might perhaps 
have the amount of carrot, say, negative, by which one would seemingly save 
money). Finally, the inequalities in the last three lines force the requirements 
on vitamins A and C and of dietary fiber. 

The linear program can be solved by standard methods. The optimal 
solution yields the price of €0.07 with the following doses: carrot 9.5 g, 
cabbage 38 g, and pickled cucumber 290 g per dish (all rounded to two 
significant digits). This probably wouldn’t pass another round of inspection. 
In reality one would have to add further constraints, for example, one on the 
maximum amount of pickled cucumber. 

We have included this example so that our treatment doesn’t look too 
revolutionary. It seems that all introductions to linear programming begin 
with various dietary problems, most likely because the first large-scale prob- 
lem on which the simplex method was tested in 1947 was the determination 
of an adequate diet of least cost. Which foods should be combined and in 
what amounts so that the required amounts of all essential nutrients are sat- 
isfied and the daily ration is the cheapest possible. The linear program had 
77 variables and 9 constraints, and its solution by the simplex method using 
hand-operated desk calculators took approximately 120 man-days. 

Later on, when George Dantzig had already gained access to an electronic 
computer, he tried to optimize his own diet as well. The optimal solution of 
the first linear program that he constructed recommended daily consumption 
of several liters of vinegar. When he removed vinegar from the next input, 
he obtained approximately 200 bouillon cubes as the basis of the daily diet. 
This story, whose truth is not entirely out of the question, doesn’t diminish 
the power of linear programming in any way, but it illustrates how difficult it 
is to capture mathematically all the important aspects of real-life problems. 
In the realm of nutrition, for example, it is not clear even today what exactly 
the influence of various components of food is on the human body. (Although, 
of course, many things are clear, and hopes that the science of the future will 
recommend hamburgers as the main ingredient of a healthy diet will almost 
surely be disappointed.) Even if it were known perfectly, few people want 
and can formulate exactly what they expect from their diet—apparently, 
it is much easier to formulate such requirements for the diet of someone 
else. Moreover, there are nonlinear dependencies among the effects of various 
nutrients, and so the dietary problem can never be captured perfectly by 
linear programming. 


There are many applications of linear programming in industry, agricul- 
ture, services, etc. that from an abstract point of view are variations of the 
diet problem and do not introduce substantially new mathematical tricks. 
It may still be challenging to design good models for real-life problems of 
this kind, but the challenges are not mathematical. We will not dwell on 
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such problems here (many examples can be found in ChvAtal’s book cited in 
Chapter 9), and we will present problems in which the use of linear program- 
ming has different flavors. 


2.2 Flow in a Network 


An administrator of a computer network convinced his employer to purchase 
a new computer with an improved sound system. He wants to transfer his 
music collection from an old computer to the new one, using a local network. 
The network looks like this: 


What is the maximum transfer rate from computer o (old) to computer n 
(new)? The numbers near each data link specify the maximum transfer rate 
of that link (in Mbit/s, say). We assume that each link can transfer data in 
either direction, but not in both directions simultaneously. So, for example, 
through the link ab one can either send data from a to b at any rate from 0 
up to 1 Mbit/s, or send data from b to a at any rate from 0 to 1 Mbit/s. 

The nodes a, b,...,e are not suitable for storing substantial amounts of 
data, and hence all data entering them has to be sent further immediately. 
From this we can already see that the maximum transfer rate cannot be used 
on all links simultaneously (consider node a, for example). Thus we have to 
find an appropriate value of the data flow for each link so that the total 
transfer rate from o to n is maximum. 

For every link in the network we introduce one variable. For example, xpe 
specifies the rate by which data is transfered from b to e. Here xpe can also be 
negative, which means that data flow in the opposite direction, from e to b. 
(And we thus do not introduce another variable x,.», which would correspond 
to the transfer rate from e to b.) There are 10 variables: toa, Lob, Loc, Lab; 
Lad; Ube, Led, Uce, Vdn; and Len: 

We set up the following linear program: 
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Maximize og + Lop + Loc 
subject to -—3<2%_q <3, -l<a2y<1, -l<a%e<1 
-l<a2qy <1, -l<a%a <1, —3 < Le <3 
—4<%q <4, —4< Xe <4, —4< an <4 
-1 < Zen < 1 
Toa = Lab + Lad 
Lob T+ Lab = Lbe 


Loc = Led T Xce 
Lad + Led = Lan 
XLbe T Lce = Len: 


The objective function %oq + Xop + Loe expresses the total rate by which data 
is sent out from computer o. Since it is neither stored nor lost (hopefully) 
anywhere, it has to be received at n at the same rate. The next 10 constraints, 
—3 < Loq < 3 through —1 < x, < 1, restrict the transfer rates along the 
individual links. The remaining constraints say that whatever enters each of 
the nodes a through e has to leave immediately. 

The optimal solution of this linear program is depicted below: 


The number near each link is the transfer rate on that link, and the arrow 
determines the direction of the data flow. Note that between c and e data has 
to be sent in the direction from e to c, and hence x. = —1. The optimum 
value of the objective function is 4, and this is the desired maximum transfer 
rate. 

In this example it is easy to see that the transfer rate cannot be larger, 
since the total capacity of all links connecting the computers o and a to the 
rest of the network equals 4. This is a special case of a remarkable theorem 
on maximum flow and minimum cut, which is usually discussed in courses on 
graph algorithms (see also Section 8.2). 

Our example of data flow in a network is small and simple. In practice, 
however, flows are considered in intricate networks, sometimes even with 
many source nodes and sink nodes. These can be electrical networks (current 
flows), road or railroad networks (cars or trains flow), telephone networks 
(voice or data signals flow), financial (money flows), and so on. There are 
also many less-obvious applications of network flows—for example, in image 
processing. 
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Historically, the network flow problem was first formulated by 
American military experts in search of efficient ways of disrupting 
the railway system of the Soviet block; see 


A. Schrijver: On the history of the transportation and max- 
imum flow problems, Math. Programming Ser. B 91(2002) 
437-445. 


2.3 Ice Cream All Year Round 


The next application of linear programming again concerns food (which 
should not be surprising, given the importance of food in life and the diffi- 
culties in optimizing sleep or love). The ice cream manufacturer Icicle Works 
Ltd.? needs to set up a production plan for the next year. Based on history, 
extensive surveys, and bird observations, the marketing department has come 
up with the following prediction of monthly sales of ice cream in the next 
year: 


sales [tons] 
700- 


600- 


5004 


400- 


3004 
2004 
100-4 


I I I I I I I I I I I I 
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


Now Icicle Works Ltd. needs to set up a production schedule to meet 
these demands. 

A simple solution would be to produce “just in time,” meaning that all 
the ice cream needed in month 7 is also produced in month 7, i = 1,2,...,12. 
However, this means that the produced amount would vary greatly from 
month to month, and a change in the produced amount has significant costs: 
Temporary workers have to be hired or laid off, machines have to be adjusted, 


? Not to be confused with a rock group of the same name. The name comes from 
a nice science fiction story by Frederik Pohl. 
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and so on. So it would be better to spread the production more evenly over 
the year: In months with low demand, the idle capacities of the factory could 
be used to build up a stock of ice cream for the months with high demand. 

So another simple solution might be a completely “flat” production sched- 
ule, with the same amount produced every month. Some thought reveals that 
such a schedule need not be feasible if we want to end up with zero surplus 
at the end of the year. But even if it is feasible, it need not be ideal either, 
since storing ice cream incurs a nontrivial cost. It seems likely that the best 
production schedule should be somewhere between these two extremes (pro- 
duction following demand and constant production). We want a compromise 
minimizing the total cost resulting both from changes in production and from 
storage of surpluses. 

To formalize this problem, let us denote the demand in month 7 by d; > 0 
(in tons). Then we introduce a nonnegative variable x; for the production in 
month 7 and another nonnegative variable s; for the total surplus in store 
at the end of month 7. To meet the demand in month i, we may use the 
production in month 7 and the surplus at the end of month 7 — 1: 


uyt+s;1.>d; fori=1,2,...,12. 


The quantity x; + s;-, — d; is exactly the surplus after month 7, and thus we 
have 
Li+ Sji-1 — 8% =d; fori = Lei2iets5 12: 


Assuming that initially there is no surplus, we set so = 0 (if we took the 
production history into account, sg would be the surplus at the end of the 
previous year). We also set s12 = 0, unless we want to plan for another year. 

Among all nonnegative solutions to these equations, we are looking for one 
that minimizes the total cost. Let us assume that changing the production 
by 1 ton from month i — 1 to month 7 costs €50, and that storage facilities 
for 1 ton of ice cream cost € 20 per month. Then the total cost is expressed 


by the function 
12 12 
505 \x5 = Xi-1| + 205° Si, 
i=1 i=1 


where we set 2 = 0 (again, history can easily be taken into account). 
Unfortunately, this cost function is not linear. Fortunately, there is a 
simple but important trick that allows us to make it linear, at the price of 
introducing extra variables. 
The change in production is either an increase or a decrease. Let us intro- 
duce a nonnegative variable y; for the increase from month 7 — 1 to month 2, 
and a nonnegative variable z; for the decrease. Then 


Li — B-1 = ys — % and |x; — 2-1| = yi + %- 


A production schedule of minimum total cost is given by an optimal so- 
lution of the following linear program: 
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Minimize 5057)7, yi + 50032, 2% +20 072, 5 
subject to 2; +8;-1 — 5; =d; fori =1,2,...,12 
Ui Hj-1 = Yi fori= TD ceate.e LD 


ro = 0 
59 = 0 
812 =0 


Xi, Si, Yi, A > 0 fori =1,2,...,12. 


To see that an optimal solution (s*, y*,z*) of this linear program indeed 
defines a schedule, we need to note that one of y* and z} has to be zero for 
all 7, for otherwise, we could decrease both and obtain a better solution. This 
means that y; + z7 indeed equals the change in production from month 7 — 1 
to month 7, as required. 

In the Icicle Works example above, this linear program yields the follow- 
ing production schedule (shown with black bars; the gray background graph 
represents the demands). 


production 


[tons] 7009 _ 
600 + 
500 + 


400 
300 - 
200 
100 


Below is the schedule we would get with zero storage costs (that is, after 
replacing the “20” by “0” in the above linear program). 


production 
[tons] 700 _ 
600 + 


500 + 
400 + 
300 + 
200 + 
100 + 
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The pattern of this example is quite general, and many problems of opti- 
mal control can be solved via linear programming in a similar manner. A neat 
example is “Moon Rocket Landing,” a once-popular game for programmable 
calculators (probably not sophisticated enough to survive in today’s compe- 
tition). A lunar module with limited fuel supply is descending vertically to 
the lunar surface under the influence of gravitation, and at chosen time inter- 
vals it can flash its rockets to slow down the descent (or even to start flying 
upward). The goal is to land on the surface with (almost) zero speed before 
exhausting all of the fuel. The reader is invited to formulate an appropriate 
linear program for determining the minimum amount of fuel necessary for 
landing, given the appropriate input data. For the linear programming for- 
mulation, we have to discretize time first (in the game this was done anyway), 
but with short enough time steps this doesn’t make a difference in practice. 

Let us remark that this particular problem can be solved analytically, with 
some calculus (or even mathematical control theory). But in even slightly 
more complicated situations, an analytic solution is out of reach. 


2.4 Fitting a Line 


The loudness level of nightingale singing was measured every evening for a 
number of days in a row, and the percentage of people watching the principal 
TV news was surveyed by questionnaires. The following diagram plots the 
measured values by points in the plane: 


TV watchers [%] 


60 + 


50 + 


40 — 


20 30 4O 50 loudness level [dB] 


The simplest dependencies are linear, and many dependencies can be well 
approximated by a linear function. We thus want to find a line that best fits 
the measured points. (Readers feeling that this example is not sufficiently 
realistic can recall some measurements in physics labs, where the measured 
quantities should actually obey an exact linear dependence.) 
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How can one formulate mathematically that a given line “best fits” the 
points? There is no unique way, and several different criteria are commonly 
used for line fitting in practice. 

The most popular one is the method of least squares, which for given 
points (21, 41),---, (Un; Yn) seeks a line with equation y = ax + b minimizing 


the expression 
n 


S"(azi + b— yi)”. (2.1) 
i=1 
In words, for every point we take its vertical distance from the line, square 
it, and sum these “squares of errors.” 

This method need not always be the most suitable. For instance, if a few 
exceptional points are measured with very large error, they can influence the 
resulting line a great deal. An alternative method, less sensitive to a small 
number of “outliers,” is to minimize the sum of absolute values of all errors: 


S/ laai +b— yi]. (2.2) 
i=1 


By a trick similar to the one we have seen in Section 2.3, this apparently 
nonlinear optimization problem can be captured by a linear program: 


Minimize e,+e2+---+e€n 
subject to e; >ax,+b-—y; for? =1,2,...,n 
e, > —(aa,+b—y;) fori =1,2,...,n. 


The variables are a, b, and €1, €2,...,@n (while 21,...,a, and y1,...,Yn are 
given numbers). Each e; is an auxiliary variable standing for the error at the 
ith point. The constraints guarantee that 


ej > max (ax; + b— y;, (ax; +b vi) = jax; + b— yl. 


In an optimal solution each of these inequalities has to be satisfied with 
equality, for otherwise, we could decrease the corresponding e;. Thus, an 
optimal solution yields a line minimizing the expression (2.2). 

The following picture shows a line fitted by this method (solid) and a line 
fitted using least squares (dotted): 
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In conclusion, let us recall the useful trick we have learned here and in 
the previous section: 


Objective functions or constraints involving absolute values can often be 


handled via linear programming by introducing extra variables or extra 
constraints. 


2.5 Separation of Points 


A computer-controlled rabbit trap “Gromit RT 2.1” should be programmed 
so that it catches rabbits, but if a weasel wanders in, it is released. The trap 
can weigh the animal inside and also can determine the area of its shadow. 


e™, 


These two parameters were collected for a number of specimens of rabbits 
and weasels, as depicted in the following graph: 


weight 


shadow area 


(empty circles represent rabbits and full circles weasels). 

Apparently, neither weight alone nor shadow area alone can be used to 
tell a rabbit from a weasel. One of the next-simplest things would be a lin- 
ear criterion distinguishing them. That is, geometrically, we would like to 
separate the black points from the white points by a straight line if possi- 
ble. Mathematically speaking, we are given m white points pj, p2,..-,Pm 
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and n black points qi, q2,---,Qn in the plane, and we would like to find out 
whether there exists a line having all white points on one side and all black 
points on the other side (none of the points should lie on the line). 

In a solution of this problem by linear programming we distinguish three 
cases. First we test whether there exists a vertical line with the required prop- 
erty. This case needs neither linear programming nor particular cleverness. 

The next case is the existence of a line that is not vertical and that has all 
black points below it and all white points above it. Let us write the equation 
of such a line as y = ax + b, where a and b are some yet unknown real 
numbers. A point r with coordinates x(r) and y(r) lies above this line if 
y(r) > aax(r) + 6, and it lies below it if y(r) < az(r) + b. So a suitable line 
exists if and only if the following system of inequalities with variables a and b 
has a solution: 


y(p;) > ax(p;) +6 fori=1,2,...,m 
y(q;) < ax(q;) +6 for 7 =1,2,...,n. 


We haven’t yet mentioned strict inequalities in connection with linear 
programming, and actually, they are not allowed in linear programs. But here 
we can get around this issue by a small trick: We introduce a new variable 6, 
which stands for the “gap” between the left and right sides of each strict 
inequality. Then we try to make the gap as large as possible: 


Maximize 06 
subject to y(p;) >ax(p;)+b+6 fori=1,2,...,m 
y(qj) <ax(qj)+6-—6 for j =1,2,...,n. 


7 y=axr+b 


O ad 
y>ar+b+6 were, 


y<axr+b—6 


This linear program has three variables: a, b, and 6. The optimal 6 is positive 
exactly if the preceding system of strict inequalities has a solution, and the 
latter happens exactly if a nonvertical line exists with all black points below 
and all white points above. 
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Similarly, we can deal with the third case, namely the existence of a non- 
vertical line having all black points above it and all white points below it. This 
completes the description of an algorithm for the line separation problem. 


A plane separating two point sets in R® can be computed by the 
same approach, and we can also solve the analogous problem in higher 
dimensions. So we could try to distinguish rabbits from weasels based 
on more than two measured parameters. 

Here is another, perhaps more surprising, extension. Let us imagine 
that separating rabbits from weasels by a straight line proved impos- 
sible. Then we could try, for instance, separating them by a graph of 
a quadratic function (a parabola), of the form ax? + br +c. So given 
m white points pi, p2,.--,Pm and n black points qi, q2,...,Qn in the 
plane, we now ask, are there coefficients a, b,c € R such that the graph 
of f(x) = ax? +b +c has all white points above it and all black points 
below? This leads to the inequality system 


y(p;) > ax(p;)? +ba(pi)+e  fori=1,2,...,m 
y(qj) < ax(qj)? + ba(qj) +e for j =1,2,...,n. 


By introducing a gap variable 6 as before, this can be written as the 
following linear program in the variables a,b,c, and 6: 


Maximize 06 
subject to y(pi) > ax(p;)? +b2(p;) +e+6 fori=1,2,...,m 
y(qj) < ax(qj)? +bx(qj)+ce—-6 for j =1,2,...,n. 


In this linear program the quadratic terms are coefficients and there- 
fore they cause no harm. 

The same approach also allows us to test whether two point sets in 
the plane, or in higher dimensions, can be separated by a function of 
the form f(x) = aiyi(x) + aaye(x) +--+ + anpr(x), where y1,..., 9x 
are given functions (possibly nonlinear) and a1, a2,...,@% are real co- 
efficients, in the sense that f(p;) > 0 for every white point p; and 
f(a;) < 0 for every black point q;. 


2.6 Largest Disk in a Convex Polygon 


Here we will encounter another problem that may look nonlinear at first 
sight but can be transformed to a linear program. It is a simple instance of a 
geometric packing problem: Given a container, in our case a convex polygon, 
we want to fit as large an object as possible into it, in our case a disk of the 
largest possible radius. 
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Let us call the given convex polygon P, and let us assume that it has 
n sides. As we said, we want to find the largest circular disk contained in P. 


Vv 


For simplicity let us assume that none of the sides of P is vertical. Let the 
ith side of P lie on a line @; with equation y = a,x + b;, 1 = 1,2,...,n, and 
let us choose the numbering of the sides in such a way that the first, second, 
up to the kth side bound P from below, while the (& + 1)st through nth side 


bound it from above. 
vo Lea 
& ) 
fn 
Ly i 


Let us now ask, under what conditions does a circle with center s = (81, s2) 
and radius r lie completely inside P? This is the case if and only if the point 
s has distance at least r from each of the lines ¢1,..., ln, lies above the lines 
f1,...,z, and lies below the lines €,41,...,£,. We compute the distance of s 
from £;. A simple calculation using similarity of triangles and the Pythagorean 
theorem shows that this distance equals the absolute value of the expression 


82 — a481 — bj 
Jaz +1 


Moreover, the expression is positive if s lies above ¢;, and it is negative if 
s lies below @;: 
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e (81, 82) 


: (81,4451 + bj) 


The disk of radius r centered at s thus lies inside P exactly if the following 
system of inequalities is satisfied: 


82 — aj81 — bj 


a? +1 

ee ee <-r, t=k+4+1,k4+2,...,n. 
Tl 
a; 


Therefore, we want to find the largest r such that there exist s; and s2 so that 
all the constraints are satisfied. This yields a linear program! (Some might be 
frightened by the square roots, but these can be computed in advance, since 
all the a; are concrete numbers.) 


Maximize r 


82 — a481 — bj 


subject to > r fori=1,2,...,k 
Jaz +1 
= as = "b; 
acd fort =k+1,k+2,...,n. 


There are three variables: s,, s2, and r. An optimal solution yields the desired 
largest disk contained in P. 


A similar problem in higher dimension can be solved analogously. For 
example, in three-dimensional space we can ask for the largest ball that can 
be placed into the intersection of n given half-spaces. 

Interestingly, another similar-looking problem, namely, finding the small- 
est disk containing a given convex n-gon in the plane, cannot be expressed 
by a linear program and has to be solved differently; see Section 8.7. 

Both in practice and in theory, one usually encounters geometric packing 
problems that are more complicated than the one considered in this section 
and not so easily solved by linear programming. Often we have a fixed collec- 
tion of objects and we want to pack as many of them as possible into a given 
container (or several containers). Such problems are encountered by confec- 
tioners when cutting cookies from a piece of dough, by tailors or clothing 
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manufacturers when making as many trousers, say, as possible from a large 
piece of cloth, and so on. Typically, these problems are computationally hard, 
but linear programming can sometimes help in devising heuristics or approx- 
imate algorithms. 


2.7 Cutting Paper Rolls 


Here we have another industrial problem, and the application of linear pro- 
gramming is quite nonobvious. Moreover, we will naturally encounter an in- 
tegrality constraint, which will bring us to the topic of the next chapter. 

A paper mill manufactures rolls of paper of a standard width 3 meters. 
But customers want to buy paper rolls of shorter width, and the mill has to 
cut such rolls from the 3 m rolls. One 3 m roll can be cut, for instance, into 
two rolls 93 cm wide, one roll of width 108 cm, and a rest of 6 cm (which 
goes to waste). 

Let us consider an order of 


e 97 rolls of width 135 cm, 

e 610 rolls of width 108 cm, 

e 395 rolls of width 93 cm, and 

e 211 rolls of width 42 cm. 

What is the smallest number of 3 m rolls that have to be cut in order to 
satisfy this order, and how should they be cut? 

In order to engage linear programming one has to be generous in intro- 
ducing variables. We write down all of the requested widths: 135 cm, 108 cm, 
93 cm, and 42 cm. Then we list all possibilities of cutting a 3 m paper roll 
into rolls of some of these widths (we need to consider only possibilities for 
which the wasted piece is shorter than 42 cm): 


Pl: 2x 135 P7: 108 + 93 +2 x 42 
P2: 135+ 108+ 42 P8: 108 + 4 x 42 

P3: 135+ 93 +4 42 P9: 3x93 

P4: 135+3 x 42 P10: 2x93+4+2x 42 
P5: 2x 10842 x 42 Pll: 93+4x 42 

P6: 108+2 x 93 P12: 7x 42 


For each possibility Pj on the list we introduce a variable x; > 0 rep- 
resenting the number of rolls cut according to that possibility. We want to 
minimize the total number of rolls cut, i.e., aa x;, in such a way that the 
customers are satisfied. For example, to satisfy the demand for 395 rolls of 
width 93 cm we require 


3 +246 +247 + 3%9 + 2419 + 111 > 395. 
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For each of the widths we obtain one constraint. 


For a more complicated order, the list of possibilities would most 
likely be produced by computer. We would be in a quite typical situ- 
ation in which a linear program is not entered “by hand,” but rather 
is generated by some computer program. More-advanced techniques 
even generate the possibilities “on the fly,” during the solution of the 
linear program, which may save time and memory considerably. See 
the entry “column generation” in the glossary or Chvatal’s book cited 
in Chapter 9, from which this example is taken. 


The optimal solution of the resulting linear program has x, = 48.5, #5 = 
206.25, x3 = 197.5, and all other components 0. In order to cut 48.5 rolls 
according to the possibility Pl, one has to unwind half of a roll. Here we 
need more information about the technical possibilities of the paper mill: 
Is cutting a fraction of a roll technically and economically feasible? If yes, 
we have solved the problem optimally. If not, we have to work further and 
somehow take into account the restriction that only feasible solutions of the 
linear program with integral x; are of interest. This is not at all easy in 
general, and it is the subject of Chapter 3. 


3. Integer Programming and LP Relaxation 


3.1 Integer Programming 


In Section 2.7 we encountered a situation in which among all feasible so- 
lutions of a linear program, only those with all components integral are of 
interest in the practical application. A similar situation occurs quite often in 
attempts to apply linear programming, because objects that can be split into 
arbitrary fractions are more an exception than the rule. When hiring workers, 
scheduling buses, or cutting paper rolls one somehow has to deal with the 
fact that workers, buses, and paper rolls occur only in integral quantities. 

Sometimes an optimal or almost-optimal integral solution can be obtained 
by simply rounding the components of an optimal solution of the linear pro- 
gram to integers, either up, or down, or to the nearest integer. In our paper- 
cutting example from Section 2.7 it is natural to round up, since we have to 
fulfill the order. Starting from the optimal solution x; = 48.5, x5 = 206.25, 
ze = 197.5 of the linear program, we thus arrive at the integral solution 
a, = 49, x5 = 207, and xg = 198, which means cutting 454 rolls. Since we 
have found an optimum of the linear program, we know that no solution 
whatsoever, even one with fractional amounts of rolls allowed, can do better 
than cutting 452.5 rolls. If we insist on cutting an integral number of rolls, we 
can thus be sure that at least 453 rolls must be cut. So the solution obtained 
by rounding is quite good. 

However, it turns out that we can do slightly better. The integral solution 
a, = 49, a5 = 207, xe = 196, and x9 = 1 (with all other components 
0) requires cutting only 453 rolls. By the above considerations, no integral 
solution can do better. 

In general, the gap between a rounded solution and an optimal integral 
solution can be much larger. If the linear program specifies that for most of 
197 bus lines connecting villages it is best to schedule something between 
0.1 and 0.3 buses, then, clearly, rounding to integers exerts a truly radical 
influence. 

The problem of cutting paper rolls actually leads to a problem with a lin- 
ear objective function and linear constraints (equations and inequalities), but 
the variables are allowed to attain only integer values. Such an optimization 
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problem is called an integer program, and after a small adjustment we can 
write it in a way similar to that used for a linear program in Chapter 1: 


An integer program: 


Maximize c!x 


subject to Ax<b 
x€ Z". 


Here A is an mxn matrix, b € R™, c € R”, Z denotes the set of integers, 
and Z” is the set of all n-component integer vectors. 

The set of all feasible solutions of an integer program is no longer a convex 
polyhedron, as was the case for linear programming, but it consists of separate 
integer points. A picture illustrates a two-dimensional integer program with 
five constraints: 


Feasible solutions are shown as solid dots and the optimal solution is marked 
by a circle. Note that it lies quite far from the optimum of the linear program 
with the same five constraints and the same objective function. 

It is known that solving a general integer program is computationally dif- 
ficult (more exactly, it is an NP-hard problem), in contrast to solving a linear 
program. Linear programs with many thousands of variables and constraints 
can be handled in practice, but there are integer programs with 10 variables 
and 10 constraints that are insurmountable even for the most modern com- 
puters and software. 
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Adding the integrality constraints can thus change the difficulty 
of a problem in a drastic way indeed. This may not look so surpris- 
ing anymore if we realize that integer programs can model yes/no 
decisions, since an integer variable x; satisfying the linear constraints 
0 < a; < 1 has possible values only 0 (no) and 1 (yes). For those 
familiar with the foundations of NP-completeness it is thus not hard 
to model the problem of satisfiability of logical formulas by an integer 
program. In Section 3.4 we will see how an integer program can ex- 
press the maximum size of an independent set in a given graph, which 
is also one of the basic NP-hard problems. 


Several techniques have been developed for solving integer programs. In 
the literature, some of them can be found under the headings cutting planes, 
branch and bound, as well as branch and cut (see the glossary). The most 
successful strategies usually employ linear programming as a subroutine for 
solving certain auxiliary problems. How to do this efficiently is investigated 
in a branch of mathematics called polyhedral combinatorics. 

The most widespread use of linear programming today, and the one that 
consumes the largest share of computer time, is most likely in auxiliary com- 
putations for integer programs. 

Let us remark that there are many optimization problems in which some 
of the variables are integral, while others may attain arbitrary real values. 
Then one speaks of mixed integer programming. This is in all likelihood the 
most frequent type of optimization problem in practice. 

We will demonstrate several important optimization problems that can 
easily be formulated as integer programs, and we will show how linear pro- 
gramming can or cannot be used in their solution. But it will be only a small 
sample from this area, which has recently developed extensively and which 
uses many complicated techniques and clever tricks. 


3.2 Maximum- Weight Matching 


A consulting company underwent a thorough reorganization, in order to 
adapt to current trends, in which the department of Creative Accounting 
with 7 employees was closed down. But flexibly enough, seven new positions 
have been created. The human resources manager, in order to assign the new 
positions to the seven employees, conducted interviews with them and gave 
them extensive questionnaires to fill out. Then he summarized the results in 
scores: Each employee got a score between 0 and 100 for each of the positions 
she or he was willing to accept. The manager depicted this information in a 
diagram, in which an expert can immediately recognize a bipartite graph: 
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quality manager 


representative appearance 


secretary 


trend analyst 


undersecretary 


vacation specialist 


webmaster 


Amos 


Boris 


Colette 


Devdatt 


Eleanor 


Fortunato 


Gudrun 


For example, this diagram tells us that Boris is willing to accept the job in 
quality management, for which he achieved score of 87, or the job of a trend 
analyst, for which he has score 70. Now the manager wants to select a position 
for everyone so that the sum of scores is maximized. The first idea naturally 
coming to mind is to give everyone the position for which he/she has the 
largest score. But this cannot be done, since, for example, three people are 
best suited for the profession of webmaster: Eleanor, Gudrun, and Devdatt. 
If we try to assign the positions by a “greedy” algorithm, meaning that in 
each step we make an assignment of largest possible score between a yet 
unoccupied position and a still unassigned employee, we end up with filling 


only 6 positions: 
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(The bold digits 1-6 indicate the order of making assignments by the greedy 
algorithm.) 

In the language of graph theory we have a bipartite graph with vertex 
set V = X U Y and edge set E. Each edge connects a vertex of X to a 
vertex of Y. Moreover, we have |X| = |Y|. For each edge e € E we are given 
a nonnegative weight we. We want to find a subset M C E of edges such 
that each vertex of both X and Y is incident to exactly one edge of M (such 
an M is called a perfect matching), and the sum >.< jy We is the largest 
possible. 

In order to formulate this problem as an integer program, we introduce 
variables x, one for each edge e € E, that can attain values 0 or 1. They 
will encode the sought-after set M: x, = 1 means e € M and x, = 0 means 
e¢ M. Then >) .¢,4 We can be written as 


S WeLe, 


ecE 


and this is the objective function. The requirement that a vertex v € V have 
exactly one incident edge of M is expressed by having the sum of - over all 
edges incident to v equal to 1. In symbols, >> te = 1. The resulting 
integer program is 


e€ E:vEe 


maximize o.c¢p Wee 
subject to Yo ecpyee Te = 1 for each vertex v € V, and (3.1) 
ave € {0,1} for each edge e € E. 


If we leave out the integrality conditions, i.e., if we allow each x, to attain 
all values in the interval [0,1], we obtain the following linear program: 


Maximize cp We%e 
subject to Soe pyee Ve = 1 for each vertex v € V, and 
0 <a. <1 for each edge ec € E. 


It is called an LP relaxation of the integer program (3.1)—we have relaxed 
the constraints «. € {0,1} to the weaker constraints 0 < x. < 1. We can 
solve the LP relaxation, say by the simplex method, and either we obtain 
an optimal solution x*, or we learn that the LP relaxation is infeasible. In 
the latter case, the original integer program must be infeasible as well, and 
consequently, there is no perfect matching. 

Let us now assume that the LP relaxation has an optimal solution x*. 
What can such an x* be good for? Certainly it provides an upper bound on the 
best possible solution of the original integer program (3.1). More precisely, the 
optimum of the objective function in the integer program (3.1) is bounded 
above by the value of the objective function at x*. This is because every 
feasible solution of the integer program is also a feasible solution of the LP 
relaxation, and so we are maximizing over a larger set of vectors in the LP 
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relaxation. An upper bound can be very valuable: For example, if we manage 
to find a feasible solution of the integer program for which the value of the 
objective function is 98% of the upper bound, we can usually stop striving, 
since we know that we cannot improve by more than roughly 2% no matter 
how hard we try. (Of course, it depends on what we are dealing with; if it is 
a state budget, even 2% is still worth some effort.) 

A pleasant surprise awaits us in the particular problem we are considering 
here: The LP relaxation not only yields an upper bound, but it provides an 
optimal solution of the original problem! Namely, if we solve the LP relaxation 
by the simplex method, we obtain an optimal x* that has all components 
equal to 0 or 1, and thus it determines an optimum perfect matching. (If a 
better perfect matching existed, it would determine a better solution of the 
LP relaxation.) The optimal solution discovered in this way is drawn below: 


A B C D E F G 


The following (nontrivial) theorem shows that things work this nicely for 
every problem of the considered type. 


3.2.1 Theorem. Let G = (V,E) be an arbitrary bipartite graph with real 
edge weights we. If the LP relaxation of the integer program (3.1) has at least 
one feasible solution, then it has at least one integral optimal solution. This 
is an optimal solution for the integer program (3.1) as well. 


An interested reader can find a proof at the end of this section. 

The theorem doesn’t say that every optimal solution of the LP relaxation 
is necessarily integral. However, the proof gives an easy recipe for producing 
an integral optimal solution from an arbitrary optimal solution. Moreover, 
it can be shown (see Section 8.2) that the simplex method always returns 
an integral optimal solution (using the terminology of Chapter 4, each basic 
feasible solution is integral). 


An LP relaxation can be considered for an arbitrary integer program: The 
condition x € Z” is simply replaced by x € R”. Cases in which we always 
get an optimal solution of the integer program from the LP relaxation, such 
as in Theorem 3.2.1, are rather rare, but LP relaxation can also be useful in 
other ways. 
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The maximum of the objective function for the LP relaxation always pro- 
vides an upper bound for the maximum of the integer program. Sometimes 
this bound is quite tight, but at other times it can be very bad; see Section 3.4. 
The quality of the upper bound from an LP relaxation has been studied for 
many types of problems. Sometimes one adds new linear constraints to the 
LP relaxation that are satisfied by all integral solutions (that is, all feasible 
solutions of the integer program satisfy these new constraints) but that ex- 
clude some of the nonintegral solutions. In this way the upper bound on the 
optimum of the integer program can often be greatly improved. If we continue 
adding suitable new constraints for long enough, then we even arrive at an 
optimal solution of the integer program. This is the main idea of the method 
of cutting planes. 

A nonintegral optimal solution of the LP relaxation can sometimes be 
converted to an approximately optimal solution of the integer program by 
appropriate rounding. We will see a simple example in Section 3.3, and a 
more advanced one in Section 8.3. 


Proof of Theorem 3.2.1. Let x* be an optimal solution of the LP re- 
laxation, and let w(x*) = )o cg Weve be the value of the objective function 
at x*. Let us denote the number of nonintegral components of the vector x* 
by k(x*). 

If k(x*) = 0, then we are done. For k(x*) > 0 we describe a procedure that 
yields another optimal solution * with k(x) < k(x*). We reach an integral 
optimal solution by finitely many repetitions of this procedure. 

Let xZ, be a nonintegral component of the vector x*, corresponding to 
some edge €; = {a1, bi}. Since 0 < v%, < 1 and 


S- Geo, 


e€ E:b1€e 


there exists another edge e2 = {a2,b1}, a2 # a1, with x, nonintegral. For a 
similar reason we can also find a third edge e3 = {a2, b2} with 0 < x, < 1. 
We continue in this manner and look for nonintegral components along a 
longer and longer path (a1, b1, a2, be, ...): 


Since the graph G has finitely many vertices, eventually we reach a vertex 
that we have already visited before. This means that we have found a cycle 
C C E in which 0 < x% < 1 for all edges. Since the graph is bipartite, the 
length t of the cycle C is even. For simplicity of notation let us assume that 
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the edges of C are €1, €2,...,e4, although in reality the cycle found as above 
need not begin with the edge e}. 
Now for a small real number ¢€ we define a vector x by 


vt—e fore € {e1,e3,...,e-1} 
Ze=< at+e fore € {e2,e4,...,e4} 
ans otherwise. 


e 


It is easy to see that this x satisfies all the conditions 


Se Ze=1, veV, 


e€ E:vee 


since at the vertices of the cycle C we have added ¢€ once and subtracted it 
once, while for all other vertices the variables of the incident edges haven’t 
changed their values at all. For € sufficiently small the conditions 0 < z < 1 
are satisfied too, since all components x, are strictly between 0 and 1. Hence 
x is again a feasible solution of the LP relaxation for all sufficiently small ¢ 
(positive or negative). 

What happens with the value of the objective function? We have 


w(&) = > wee. = w(x") + ed (-D'we, =w(x*)+eA, 


ecE 


where we have set A = )7‘_, (—1)‘we,. Since x* is optimal, necessarily A = 0, 
for otherwise, we could achieve w(x) > w(x*) either by choosing e > 0 (for 
A > 0) or by choosing ¢ < 0 (for A < 0). This means that x is an optimal 
solution whenever it is feasible, i.e., for all ¢ with a sufficiently small absolute 
value. 

Let us now choose the largest ¢ > 0 such that x is still feasible. Then there 
has to exist e € {e€1,€2,...,e:} with & € {0,1}, and x has fewer nonintegral 
components than x*. 


Let us now consider another situation, in which we have more 
employees than positions and we want to fill all positions optimally, 
ie., so that the sum of scores is maximized. Then we have |X| < |Y| 
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in the considered bipartite graph, and for vertices v € Y the condition 
ecE:vee Le = 1 (every vertex of Y is incident to exactly one edge of 
M) is replaced by \oeegyee Ze < 1 (every vertex of Y is incident to 
at most one edge of M). The claim of Theorem 3.2.1 remains valid: If 
the LP relaxation has a feasible solution, then it also has an integral 
optimal solution. The proof presented above can be modified to show 
this. We present a different and more conceptual proof in Section 8.2 
(which also yields an alternative proof of Theorem 3.2.1). We will also 
briefly discuss the nonbipartite case in Section 8.2. 


3.3 Minimum Vertex Cover 


The Internet had been expanding rapidly in the Free Republic of West Mor- 
dor, and the government issued a regulation, purely in the interest of im- 
proved security of the citizens, that every data link connecting two computers 
must be equipped with a special device for gathering statistical data about 
the traffic. An operator of a part of the network has to attach the govern- 
ment’s monitoring boxes to some of his computers so that each link has a 
monitored computer on at least one end. Which computers should get boxes 
so that the total price is minimum? Let us assume that there is a flat rate 
per box. 

It is again convenient to use graph-theoretic terminology. The computers 
in the network are vertices and the links are edges. So we have a graph 
G = (V, E) and we want to find a subset S' C V of the vertices such that each 
edge has at least one end-vertex in S (such an S is called a vertex cover), 
and S' is as small as possible. 

This problem can be written as an integer program: 


Minimize Yo ycy tv 
subject to 2,+2,>41 for every edge {u,v} € E (3.2) 
Ly € {0,1} forallveV. 


For every vertex v we have a variable x,, which can attain values 0 or 1. 
The meaning of z, = 1 is v € S, and z, = 0 means v ¢ S. The constraint 
Ly +Ly > 1 guarantees that the edge {u,v} has at least one vertex in S. The 
objective function is the size of S. 

It is known that finding a minimum vertex cover is a computationally 
difficult (NP-hard) problem. We will describe an approximation algorithm 
based on linear programming that always finds a vertex cover with at most 
twice as many vertices as in the smallest possible vertex cover. 

An LP relaxation of the above integer program is 


minimize )Vyecy Zu 
subject to 2%, +2, >1 for every edge {u,v} € E (3.3) 
O0<a,<1 forallveV. 
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The first step of the approximation algorithm for vertex cover consists in 
computing an optimal solution x* of this LP relaxation (by some standard 
algorithm for linear programming). The components of x* are real numbers 
in the interval [0,1]. In the second step we define the set 


Sip ={v EV: a5 > §}. 


This is a vertex cover, since for every edge {u,v} we have 2% + «*% > 1, and 
sows Ora SF: 

Let Sopr be some vertex cover of the minimum possible size (we don’t 
have it but we can theorize about it). We claim that 


|Sup| < 2-|Soprt|. 


To see this, let x be an optimal solution of the integer program (3.2), 
which corresponds to the set Soprt, i.e., Z) = 1 for v € Sopr and z, = 0 
otherwise. This x is definitely a feasible solution of the LP relaxation (3.3), 
and so it cannot have a smaller value of the objective function than an optimal 


solution x* of (3.3): 
ys ui, < S- Ly. 


vEV vEeV 


On the other hand, |Sip| = esis Sy eet. since as $ for each 
v € Sip. Therefore 


[Sup] <2: So ah < 2-0 & =2-|Soprl. 


vEeV vEeV 


This proof illustrates an important aspect of approximation algorithms: 
In order to assess the quality of the computed solution, we always need a 
bound on the quality of the optimal solution, although we don’t know it. The 
LP relaxation provides such a bound, which can sometimes be useful, as in 
the example of this section. In other problems it may be useless, though, as 
we will see in the next section. 


Remarks. A natural attempt at an approximate solution of the con- 
sidered problem is again a greedy algorithm: Select vertices one by one 
and always take a vertex that covers the maximum possible number 
of yet uncovered edges. Although this algorithm may not be bad in 
most cases, examples can be constructed in which it yields a solution 
at least ten times worse, say, than an optimal solution (and 10 can be 
replaced by any other constant). Discovering such a construction is a 
lovely exercise. 

There is another, combinatorial, approximation algorithm for the 
minimum vertex cover: First we find a maximal matching M, that is, a 
matching that cannot be extended by adding any other edge (we note 
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that such a matching need not have the maximum possible number 
of edges). Then we use the vertices covered by M as a vertex cover. 
This always gives a vertex cover at most twice as big as the optimum, 
similar to the algorithm explained above. 

The algorithm based on linear programming has the advantage of 
being easy to generalize for a weighted vertex cover (the govern- 
ment boxes may have different prices for different computers). In the 
same way as we did for unit prices one can show that the cost of the 
computed solution is never larger than twice the optimum cost. As in 
the unweighted case, this result can also be achieved with combina- 
torial algorithms, but these are more difficult to understand than the 
linear programming approach. 


3.4 Maximum Independent Set 


Here the authors got tired of inventing imitations of real-life problems, and 
so we formulate the next problem in the language of graph theory right away. 
For a graph G = (V, FE), a set I C V of vertices is called independent (or 
stable) if no two vertices of I are connected by an edge in G. 

Computing an independent set with the maximum possible number of 
vertices for a given graph is one of the notoriously difficult graph-theoretic 
problems. It can be easily expressed by an integer program: 


Maximize ycy tv 
subject to ¢,+2, <1 for each edge {u,v} € EF, and (3.4) 
ty € {0,1} forallveV. 


An optimal solution x* corresponds to a maximum independent set: v € I if 
and only if «} = 1. The constraints x, + %, < 1 ensure that whenever two 
vertices u and v are connected by an edge, only one of them can get into I. 

In an LP relaxation the condition x, € {0,1} is replaced by the inequali- 
ties 0 < zy < 1. The resulting linear program always has a feasible solution 
with all x, = 4, which yields objective function equal to wt Hence the op- 
timal value of the objective function is Ww or larger. 

Let us consider a complete graph on n vertices (the graph in which every 
two vertices are connected by an edge). The largest independent set consists 
of a single vertex and thus has size 1. However, as we have seen, the optimal 
value for the LP relaxation is at least n/2. Hence, and this is the point of 
this section, the LP relaxation behaves in a way completely different from 
the original integer program. 


The complete graph is by no means an isolated case. Dense graphs 
typically have a maximum independent set much smaller than half of 
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the vertices, and so for such graphs, too, an optimal solution of the 
LP relaxation tells us almost nothing about the maximum independent 
set. 

It is even known that the size of a maximum independent set can- 
not be approximated well by any reasonably efficient algorithm what- 
soever (provided that some widely believed but unproved assumptions 
hold, such as PA NP). This result is from 


J. Hastad: Clique is hard to approximate within n!~*, Acta 
Mathematica 182(1999) 105-142, 


and 


http: //www.nada.kth.se/“viggo/problemlist/compendium. 
html 


is a comprehensive website for inapproximability results. 


4. Theory of Linear Programming: 
First Steps 


4.1 Equational Form 


In the introductory chapter we explained how each linear program can be 
converted to the form 


maximize ce? x subject to Ax < b. 


But the simplex method requires a different form, which is usually called the 
standard form in the literature. In this book we introduce a less common, 
but more descriptive term equational form. It looks like this: 


Equational form of a linear program: 


Maximize c!x 


subject to Ax=b 
x > 0. 


As usual, x is the vector of variables, A is a given mxn matrix, c € R”, 
b € R” are given vectors, and O is the zero vector, in this case with n 


components. 
The constraints are thus partly equations, and partly inequalities of a 
very special form 7; > 0,7 =1,2,...,n, called nonnegativity constraints. 


(Warning: Although we call this form equational, it contains inequalities as 
well, and these must not be forgotten!) 

Let us emphasize that all variables in the equational form have to satisfy 
the nonnegativity constraints. 

In problems encountered in practice we often have nonnegativity con- 
straints automatically, since many quantities, such as the amount of con- 
sumed cucumber, cannot be negative. 


Transformation of an arbitrary linear program to equational form. 
We illustrate such a transformation for the linear program 


maximize 32; — 272 

subject to 2%, -—2%2 <4 
a+ 322 > 5 
v2 ed 0. 
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We proceed as follows: 


1. In order to convert the inequality 2%; — x2 < 4 to an equation, we in- 
troduce a new variable x3, together with the nonnegativity constraint 
x3 > 0, and we replace the considered inequality by the equation 
221, — 2+ 23 = 4. The auxiliary variable 73, which won’t appear any- 
where else in the transformed linear program, represents the difference 
between the right-hand side and the left-hand side of the inequality. Such 
an auxiliary variable is called a slack variable. 

2. For the next inequality x; + 3%2 > 5 we first multiply by —1, which 
reverses the direction of the inequality. Then we introduce another slack 
variable x4 with the nonnegativity constraint x, > 0, and we replace the 
inequality by the equation —x, — 342 + 4 = —5. 

3. We are not finished yet: The variable x; in the original linear program 
is allowed to attain both positive and negative values. We introduce two 
new, nonnegative, variables y; and z1, y; > 0, 21 > 0, and we substitute 
for x; the difference y; — z, everywhere. The variable x itself disappears. 


The resulting equational form of our linear program is 


maximize 3y) — 321 — 2x2 
subject to 2Y1 = 221 —%2+%3= 4 
Yi + 21 — 3% + %4 = —5 
yi 20, 21 >0;, 2 > 0, 3 > 0, tq > 0. 


So as to comply with the conventions of the equational form in full, we should 
now rename the variables to 71, £2,...,25. 

The presented procedure converts an arbitrary linear program with n vari- 
ables and m constraints into a linear program in equational form with at most 
m+ 2n variables and m equations (and, of course, nonnegativity constraints 
for all variables). 


Geometry of a linear program in equational form. Let us consider a 
linear program in equational form: 


Maximize c’x subject to Ax = b, x > 0. 


As is derived in linear algebra, the set of all solutions of the system Ax = b 
is an affine subspace F' of the space R”. Hence the set of all feasible solutions 
of the linear program is the intersection of F with the nonnegative orthant, 
which is the set of all points in R” with all coordinates nonnegative.! The 
following picture illustrates the geometry of feasible solutions for a linear 
program with n = 3 variables and m = 1 equation, namely, the equation 
gt+a2+273=1: 


In the plane (n = 2) this set is called the nonnegative quadrant, in R° it is the 
nonnegative octant, and the name orthant is used for an arbitrary dimension. 


4.1 Equational Form 43 


_ The set of all solutions of Ax = b 


X3 Le (a plane) 


_. The set of all feasible solutions 
(a triangle) 


x2 


(In interesting cases we usually have more than 3 variables and no picture 
can be drawn.) 


A preliminary cleanup. Now we will be talking about solutions of the 
system Ax = b. By this we mean arbitrary real solutions, whose components 
may be positive, negative, or zero. So this is not the same as feasible solutions 
of the considered linear program, since a feasible solution has to satisfy Ax = 
b and have all components nonnegative. 

If we change the system Ax = b by some transformation that preserves 
the set of solutions, such as a row operation in Gaussian elimination, it influ- 
ences neither feasible solutions nor optimal solutions of the linear program. 
This will be amply used in the simplex method. 


Assumption: We will consider only linear programs in equational form 
such that 


e the system of equations Ax = b has at least one solution, and 
e the rows of the matrix A are linearly independent. 


As an explanation of this assumption we need to recall a few facts from 
linear algebra. Checking whether the system Ax = b has a solution is easy 
by Gaussian elimination, and if there is no solution, the considered linear 
program has no feasible solution either, and we can thus disregard it. 

If the system Ax = b has a solution and if some row of A is a linear 
combination of the other rows, then the corresponding equation is redundant 
and it can be deleted from the system without changing the set of solutions. 
We may thus assume that the matrix A has m linearly independent rows and 
(therefore) rank m. 
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4.2 Basic Feasible Solutions 


Among all feasible solutions of a linear program, a privileged status is granted 
to so-called basic feasible solutions. In this section we will consider them only 
for linear programs in equational form. Let us look again at the picture of 
the set of feasible solutions for a linear program with n = 3, m= 1: 

x3 


Ty 


x2 


Among the feasible solutions p, q, and r only r is basic. Expressed geomet- 
rically and very informally, a basic feasible solution is a tip (corner, spike) of 
the set of feasible solutions. We will formulate this kind of geometric descrip- 
tion of a basic feasible solution later (see Theorem 4.4.1). 

The definition that we present next turns out to be equivalent, but it 
looks rather different. It requires that, very roughly speaking, a basic feasible 
solution have sufficiently many zero components. Before stating it we intro- 
duce a new piece of notation. 

In this section A is always a matrix with m rows and n columns (n > m), 
of rank m. For a subset B C {1,2,...,n} we let Ag denote the matrix 
consisting of the columns of A whose indices belong to B. For instance, for 


15 3 4 6 5 4 
A=(4 135 g ) and B= {2,4} we have An = ( } a) 


We will use a similar notation for vectors; e.g., for x = (3,5,7,9,11) and 
B = {2,4} we have xg = (5,9). 
Now we are ready to state a formal definition. 


A basic feasible solution of the linear program 


maximize c’ x subject to Ax = b and x > 0 


is a feasible solution x € R” for which there exists an m-element set 
BC {1,2,...,n} such that 


e the (square) matrix Ag is nonsingular, i.e., the columns indexed by 
B are linearly independent, and 
e x; =0 for all j ¢ B. 
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For example, x = (0, 2,0,1,0) is a basic feasible solution for 


153 4 6 
A=(; 13°55 5) B= (147) 


with B = {2,4}. 

If such a B is fixed, we call the variables x; with 7 € B the basic vari- 
ables, while the remaining variables are called nonbasic. We can thus briefly 
say that all nonbasic variables are zero in a basic feasible solution. 

Let us note that the definition doesn’t consider the vector c at all, and so 
basic feasible solutions depend solely on A and b. 


For some considerations it is convenient to reformulate the definition of a 
basic feasible solution a little. 


4.2.1 Lemma. A feasible solution x of a linear program in equational form 
is basic if and only if the columns of the matrix Ax are linearly independent, 
where K = {j € {1,2,...,n}: a, > O}. 


Proof. One of the implications is obvious: If x is a basic feasible solution 
and B is the corresponding m-element set as in the definition, then kK C B 
and thus the columns of the matrix Ax are linearly independent. 
Conversely, let x be feasible and such that the columns of Ax are linearly 
independent. If |K| = m, then we can simply take B = K. Otherwise, for 
|| <_m, we extend K to an m-element set B by adding m—|K| more indices 
so that the columns of Ag are linearly independent. This is a standard fact 
of linear algebra, which can be verified using the algorithm described next. 
We initially set the current B to K, and repeat the following step: If A 
has a column that is not in the linear span of the columns of Ag, we add the 
index of such a column to B. As soon as this step is no longer possible, that 
is, all columns of A are in the linear span of the columns of B, it is easily seen 
that the columns of Ag constitute a basis of the column space of A. Since 
A has rank m, we have |B| = m as needed. 


4.2.2 Proposition. A basic feasible solution is uniquely determined by the 
set B. That is, for every m-element set B C {1,2,...,n} with Ap nonsingular 
there exists at most one feasible solution x € R” with x; =0 for allj ¢ B. 


Let us stress right away that a single basic feasible solution may be ob- 
tained from many different sets B. 


Proof of Proposition 4.2.2. For x to be feasible we must have Ax = b. 
The left-hand side can be rewritten to Ax = Agxgp + Anxyn, where N = 
{1,2,...,n} \ B. For x to be a basic feasible solution, the vector xy of 
nonbasic variables must equal 0, and thus the vector xg of basic variables 
satisfies Apxp = b. And here we use the fact that Ag is a nonsingular square 
matrix: The system Apxp = b has exactly one solution xz. If all components 
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of Xg are nonnegative, then we have exactly one basic feasible solution for 
the considered B (we amend Xp by zeros), and otherwise, we have none. 


We introduce the following terminology: We call an m-element set B C 
{1,2,...,n} with Ag nonsingular a basis.” If, moreover, B determines a 
basic feasible solution, or in other words, if the unique solution of the system 
Apxgp = b is nonnegative, then we call B a feasible basis. 

The following theorem deals with the existence of optimal solutions, and 
moreover, it shows that it suffices to look for them solely among basic feasible 
solutions. 


4.2.3 Theorem. Let us consider a linear program in equational form 
maximize ce’ x subject to Ax = b,x > 0. 


(i) (“Optimal solutions may fail to exist only for obvious reasons.” ) If there is 
at least one feasible solution and the objective function is bounded from 
above on the set of all feasible solutions, then there exists an optimal 
solution. 

(ii) If an optimal solution exists, then there is a basic feasible solution that 
is optimal. 


A proof is not necessary for further reading and we defer it to the end 
of this section. The theorem also follows from the correctness of the simplex 
method, which will be discussed in the next chapter. 

The theorem just stated implies a finite, although entirely impractical, 
algorithm for solving linear programs in equational form. We consider all m- 
element subsets B C {1,2,...,n} one by one, and for each of them we check 
whether it is a feasible basis, by solving a system of linear equations (we 
obtain at most one basic feasible solution for each B by Proposition 4.2.2). 
Then we calculate the maximum of the objective function over all basic fea- 
sible solutions found in this way. 

Strictly speaking, this algorithm doesn’t work if the objective function is 
unbounded. Formulating a variant of the algorithm that functions properly 
even in this case, i.e., it reports that the linear program is unbounded, we leave 
as an exercise. Soon we will discuss the considerably more efficient simplex 
method, and there we show in detail how to deal with unboundedness. 

We have to consider () sets B in the above algorithm.*? For example, 
for n = 2m, the function () grows roughly like 4”, i.e., exponentially, and 
this is too much even for moderately large m. 


? This is a shortcut. The index set B itself is not a basis in the sense of linear 
algebra, of course. Rather the set of columns of the matrix Ag constitutes a 
basis of the column space of A. 

3 We recall that the binomial coefficient (") a aoa counts the number of 
m-element subsets of an n-element set. 
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As we will see in Chapter 5, the simplex method also goes through basic 
feasible solutions, but in a more clever way. It walks from one to another while 
improving the value of the objective function all the time, until it reaches an 
optimal solution. 

Let us summarize the main findings of this section. 


A linear program in equational form has finitely many basic feasible 
solutions, and if it is feasible and bounded, then at least one of the basic 
feasible solutions is optimal. 


Consequently, any linear program that is feasible and bounded has an 
optimal solution. 


Proof of Theorem 4.2.3. We will use some steps that will reappear in the 
simplex method in a more elaborate form, and so the present proof is a kind 
of preparation. We prove the following statement: 


If the objective function of a linear program in equational form is 
bounded above, then for every feasible solution xg there exists a 
basic feasible solution x with the same or larger value of the objective 
function; i.e., eX > c? xo. 


How does this imply the theorem? If the linear program is feasible and 
bounded, then according to the statement, for every feasible solution there 
is a basic feasible solution with the same or larger objective function. Since 
there are only finitely many basic feasible solutions, some of them have to 
give the maximum value of the objective function, which means that they 
are optimal. We thus get both (i) and (ii) at once. 


In order to prove the statement, let us consider an arbitrary feasible solu- 
tion xg. Among all feasible solutions x with c’x > c! xp we choose one that 
has the largest possible number of zero components, and we call it x (it need 
not be determined uniquely). We define an index set 


Kke={j €{1,2,...,n}: %; > 0}. 


If the columns of the matrix Ax are linearly independent, then x is a basic 
feasible solution as in the statement, by Lemma 4.2.1, and we are done. 

So let us suppose that the columns of Ax are linearly dependent, which 
means that there is a nonzero ||-component vector v such that Axv = 0. 
We extend v by zeros in positions outside K to an n-component vector w 
(sowKx =v and Aw = Axv =O). 

Let us assume for a moment that w satisfies the following two conditions 
(we will show later why we can assume this): 


(i) ctw 0. 
(ii) There exists 7 € K with w; <0. 
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For a real number t > 0 let us consider the vector x(t) = x + tw. We show 
that for some suitable t; > 0 the vector x(t ) is a feasible solution with 
more zero components than x. At the same time, c’x(t;) = ce? + te? w > 
c! x9 +t1e? w > c’ xo, and so we get a contradiction to the assumption that 
x has the largest possible number of zero components. 

We have Ax(t) = b for all ¢ since Ax(t) = Ax +tAw = Ax = b, because 
X is feasible. Moreover, for t = 0 the vector x(0) = x has all components from 
K strictly positive and all other components zero. For the jth component of 
x(t) we have x(t); = %; +tw,, and if w; < 0 as in condition (ii), we get 
x(t); <0 for all sufficiently large t > 0. If we begin with t = 0 and let t grow, 
then those a(t); with w; <0 are decreasing, and at a certain moment ¢ the 
first of these decreasing components reaches 0. At this moment, obviously, 
x(t) still has all components nonnegative, and thus it is feasible, but it has 
at least one extra zero component compared to x. This, as we have already 
noted, is a contradiction. 

Now what do we do if the vector w fails to satisfy condition (i) or (ii)? 
If c’w = 0, then (i) holds and (ii) can be recovered by changing the sign 
of w (since w # 0). So we assume c’w # 0, and again after a possible 
sign change we can achieve c’'w > 0 and thus (i). Now if (ii) fails, we must 
have w > 0. But this means that x(t) = x +tw > 0 for all t > 0, and 
hence all such x(t) are feasible. The value of the objective function for x(t) 
is ec? x(t) = ce? x + te’ w, and it tends to infinity as t + oo. Hence the linear 
program is unbounded. This concludes the proof. 


4.3 ABC of Convexity and Convex Polyhedra 


Convexity is one of the fundamental notions in all mathematics, and in the 
theory of linear programming it is encountered very naturally. Here we recall 
the definition and present some of the most basic notions and results, which, 
at the very least, help in gaining a better intuition about linear programming. 

On the other hand, linear programming can be presented without these 
notions, and in concise courses there is usually no time for such material. 
Accordingly, this section and the next are meant as extending material, and 
the rest of the book should mostly be accessible without them. 


A set X C R” is convex if for every two points x,y € X it also contains 
the segment xy. Expressed differently, for every x,y € X and every 


t € [0,1] we have tx + (1—-t)y € X. 


A word of explanation might be in order: tx + (1—t)y is the point on the 
segment xy at distance t from y and distance 1—t from x, if we take the 
length of the segment as unit distance. 

Here are a few examples of convex and nonconvex sets in the plane: 
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nonconvex convex 


The convex set at the bottom right in this picture, a stadium, is worth re- 
membering, since often it is a counterexample to statements about convex 
sets that may look obvious at first sight but are false. 

In calculus one works mainly with convex functions. Both notions, convex 
sets and convex functions, are closely related: For instance, a real function 
f:R = R is convex if and only if its epigraph, i.e., the set {(z,y) € R?: y > 
f(x)}, is a convex set in the plane. In general, a function f : X — R is called 
convex if for every x,y € X and every t € [0, 1] we have 


f(tx + (1—-t)y) < tf(x) + 1-1) f(y). 


The function is called strictly convex if the inequality is strict for allx 4 y. 


Convex hull and convex combinations. It is easily seen that the inter- 
section of an arbitrary collection of convex sets is again a convex set. This 
allows us to define the convex hull. 

Let X C R” be a set. The convex hull of X is the intersection of all 
convex sets that contain X. Thus it is the smallest convex set containing X, 
in the sense that any convex set containing X also contains its convex hull. 


the convex hull of X 


This is not a very constructive definition. The convex hull can also be 
described using convex combinations, in a way similar to the description of 
the linear span of a set of vectors using linear combinations. Let x1, X2,...,Xm 
be points in R”. Every point of the form 


m 
1X1 + toX2 +++: +tmXm, where ty, t2,...,tm >Oand >t; =1, 
i=l 


is called a convex combination of x1, X2,...,Xm. A convex combination is 
thus a particular kind of a linear combination, in which the coefficients are 
nonnegative and sum to 1. 
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Convex combinations of two points x and y are of the form tx+(1—t)y, t € 
(0, 1], and as we said after the definition of a convex set, they fill exactly the 
segment xy. It is easy but instructive to show that all convex combinations 
of three points x,y,z fill exactly the triangle xyz (unless the points are 
collinear, that is). 


4.3.1 Lemma. The convex hull C of a set X C R” equals the set 


m m 
_ {dots m2 Lijit eA te Oy a= i} 
i=1 i=l 


of all convex combinations of finitely many points of X. 


Proof. First we prove by induction on m that each convex combination has 
to lie in the convex hull C’. For m = 1 it is obvious and for m = 2 it follows 
directly from the convexity of C. 

Let m > 3 and let x = tyx; +---+tmXm be a convex combination of 
points of X. If t, = 1, then we have x = x, € C. For tm < 1 let us put 
th =t;/(l—tm), 7 = 1,2,...,m—1. Then x’ = t)x1 +---+U,_)Xm-1 is 
a convex combination of the points x,,...,Xm-—1 (the t/ sum to 1), and by 
the inductive hypothesis x’ € C. So x = (1 —tm)x’ + tmXm is a convex 
combination of two points of the (convex) set C' and as such it also lies in C. 

We have thus proved C C C. For the reverse inclusion it suffices to prove 
that C is convex, that is, to verify that whenever x,y € C are two convex 
combinations and t € (0, 1), then tx + (1—t)y is again a convex combination. 
This is straightforward and we take the liberty of omitting further details. 


Convex sets encountered in the theory of linear programming are of a 
special type and they are called convex polyhedra. 
Hyperplanes, half-spaces, polyhedra. We recall that a hyperplane 
in R” is an affine subspace of dimension n—1. In other words, it is the set of 
all solutions of a single linear equation of the form 


Q121 + ag%2 +++: + Gn%y = |, 


where @1, @2,..., 4», are not all 0. Hyperplanes in R? are lines and hyperplanes 
in R® are ordinary planes. 

A hyperplane divides R” into two half-spaces and it constitutes their 
common boundary. For the hyperplane with equation a,x, + ag%2q +--+ + 
AnXy = b, the two half-spaces have the following analytic expression: 


{xe R” : ayx, + ag%q +++ + Ant < o} 


and 


{x ER” : ajax, + ae%2 + +++ + Antn > o}. 


More exactly, these are closed half-spaces that contain their boundary. 
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A convex polyhedron is an intersection of finitely many closed half- 


spaces in R”. 


A half-space is obviously convex, and hence an intersection of half-spaces 
is convex as well. Thus convex polyhedra bear the attribute convex by right. 

A disk in the plane is a convex set, but it is not a convex polyhedron 
(because, roughly speaking, a convex polyhedron has to be “edgy”... but 
try proving this formally). 

A half-space is the set of all solutions of a single linear inequality (with 
at least one nonzero coefficient of some variable z;). The set of all solutions 
of a system of finitely many linear inequalities, a.k.a. the set of all feasible 
solutions of a linear program, is geometrically the intersection of finitely many 
half-spaces, alias a convex polyhedron. (We should perhaps also mention that 
a hyperplane is the intersection of two half-spaces, and so the constraints can 
be both inequalities and equations.) 

Let us note that a convex polyhedron can be unbounded, since, for ex- 
ample, a single half-space is also a convex polyhedron. A bounded convex 
polyhedron, i.e. one that can be placed inside some large enough ball, is 
called a convex polytope. 

The dimension of a convex polyhedron P C R” is the smallest dimension 
of an affine subspace containing P. Equivalently, it is the largest d for which 
P contains points xo, X1,...,Xq such that the d-tuple of vectors (x; —xo0, X2— 
X0,---,Xa — Xo) is linearly independent. 

The empty set is also a convex polyhedron, and its dimension is usually 
defined as —1. 

All convex polygons in the plane are two-dimensional convex polyhe- 
dra. Several types of three-dimensional convex polyhedra are taught at high 
schools and decorate mathematical cabinets, such as cubes, boxes, pyramids, 
or even regular dodecahedra, which can also be met as desktop calendars. 
Simple examples of convex polyhedra of an arbitrary dimension n are: 


e The n-dimensional cube [—1,1]”, which can be written as the intersec- 
tion of 2n half-spaces (which ones’): 
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e the n-dimensional crosspolytope {x € R” : |x1|+|v2|+---+|a,| < 1}: 


n=1 n=2 n=3 


For n = 3 we get the regular octahedron. For expressing the n-dimensional 
crosspolytope as an intersection of half-spaces we need 2” half-spaces (can 
you find them?). 

e The regular n-dimensional simplex 


n=1 n=2 n=3 
can be defined in a quite simple and nice way as a subset of R*+!: 
{xe R™+!, Lisle ane 0, a, +4%2+--- + 2n41 = 1}. 


We note that this is exactly the set of all feasible solutions of the linear 
program with the single equation 7; + v2 +---+2%n41, = 1 and non- 
negativity constraints;+ see the picture in Section 4.1. In general, any 
n-dimensional convex polytope bounded by n+1 hyperplanes is called a 
simplex. 


Many interesting examples of convex polyhedra are obtained as sets of feasible 
solutions of natural linear programs. For example, the LP relaxation of the 
problem of maximum-weight matching (Section 3.2) for a complete bipartite 
graph leads to the Birkhoff polytope. Geometric properties of such polyhedra 


* On the other hand, the set of feasible solutions of a linear program in equational 
form certainly isn’t always a simplex! The simplex method is so named for a 
rather complicated reason, related to an alternative geometric view of a linear 
program in equational form, different from the one discussed in this book. Ac- 
cording to this view, the m-tuple of numbers in the jth column of the matrix A 
together with the number c; is interpreted as a point in R™*++. Then the simplex 
method can be interpreted as a walk through certain simplices with vertices at 
these points. It was this view that gave Dantzig faith in the simplex method and 
convinced him that it made sense to study it. 
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are often related to properties of combinatorial objects and to solutions of 
combinatorial optimization problems in an interesting way. A nice book about 
convex polyhedra is 


G. M. Ziegler: Lectures on Polytopes, Springer-Verlag, Heidelberg, 
1994 (corrected 2nd edition 1998). 


The book 


B. Grtinbaum: Convex Polytopes, second edition prepared by Volker 
Kaibel, Victor Klee, and Ginter Ziegler, Springer-Verlag, Heidelberg, 
2003 


is a new edition of a 1967 classics, with extensive updates on the material 
covered in the original book. 


4.4 Vertices and Basic Feasible Solutions 


A vertex of a convex polyhedron can be thought of as a “tip” or “spike.” For 
instance, a three-dimensional cube has 8 vertices, and a regular octahedron 
has 6 vertices. 

Mathematically, a vertex is defined as a point where some linear function 
attains a unique maximum. Thus a point v is called a vertex of a convex 
polyhedron P Cc R” if v € P and there exists a nonzero vector c € R” 
such that c’v > cy for all y € P \ {v}. Geometrically it means that the 
hyperplane {x € R" : cx = cv} touches the polyhedron P exactly at v. 


Three-dimensional polyhedra have not only vertices, but also edges 
and faces. A general polyhedron P C R” of dimension n can have 
vertices, edges, 2-dimensional faces, 3-dimensional faces, up to (n—1)- 
dimensional faces. They are defined as follows: A subset F' C P isa 
k-dimensional face of a convex polyhedron P if F' has dimension k 
and there exist a nonzero vector c € R” and a number z € R such that 
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c’x = z for all x € F and c’ x < z for all x € P \ F. In other words, 


there exists a hyperplane that touches P exactly at F’. Since such an 
F is the intersection of a hyperplane with a convex polyhedron, it is 
a convex polyhedron itself, and its dimension is thus well defined. An 
edge is a 1-dimensional face and a vertex is a 0-dimensional face. 


Now we prove that vertices of a convex polyhedron and basic feasible 
solutions of a linear program are the same concept. 


4.4.1 Theorem. Let P be the set of all feasible solutions of a linear program 
in equational form (so P is a convex polyhedron). Then the following two 
conditions for a point v € P are equivalent: 


(i) v is a vertex of the polyhedron P. 
(ii) v is a basic feasible solution of the linear program. 


Proof. The implication (i)=(ii) follows immediately from Theorem 4.2.3 
(with c being the vector defining v). It remains to prove (ii)=(i). 

Let us consider a basic feasible solution v with a feasible basis B, and let 
us define a vector ¢ € R” by ¢; = 0 for 7 € B and ¢; = —1 otherwise. We have 
é’v = 0, and é?x < 0 for any x > O, and hence v maximizes the objective 
function €7x. Moreover, €'x < 0 whenever x has a nonzero component 
outside B. But by Proposition 4.2.2, v is the only feasible solution with all 
nonzero components in B, and therefore v is the only point of P maximizing 
é? x. 


Basic feasible solutions for arbitrary linear programs. A sim- 
ilar theorem is valid for an arbitrary linear program, not only for one 
in equational form. We will not prove it here, but we at least say what 
a basic feasible solution is for a general linear program: 


4.4.2 Definition. A basic feasible solution of a linear program 
with n variables is a feasible solution for which some n linearly inde- 
pendent constraints hold with equality. 


A constraint that is an equation always has to be satisfied with 
equality, while an inequality constraint may be satisfied either with 
equality or with a strict inequality. The nonnegativity constraints 
satisfied with equality are also counted. The linear independence of 
constraints means that the vectors of the coefficients of the vari- 
ables are linearly independent. For example, for n = 4, the constraint 
3x1 + 5x3 — 7x4 < 10 has the corresponding vector (3,0,5,—7). 

As is known from linear algebra, a system of n linearly independent 
linear equations in n variables has exactly one solution. Hence, if x is 
a basic feasible solution and it satisfies some n linearly independent 
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constraints with equality, then it is the only point in R” that satisfies 
these n constraints with equality. Geometrically speaking, the con- 
straints satisfied with equality determine hyperplanes, x lies on some 
n of them, and these n hyperplanes meet in a single point. 

The definition of a basic feasible solution for the equational form 
looks quite different, but in fact, it is a special case of the new defini- 
tion, as we now indicate. For a linear program in equational form we 
have m linearly independent equations always satisfied with equality, 
and so it remains to satisfy with equality some n — m of the non- 
negativity constraints, and these must be linearly independent with 
the equations. The coefficient vector of the nonnegativity constraint 
x; > Oise;, with 1 at position 7 and with zeros elsewhere. If x is a ba- 
sic feasible solution according to the new definition, then there exists 
aset N C {1,2,...,n} of sizen—m such that x; = 0 for all j € N and 
the rows of the matrix A together with the vectors (e; : 7 € N) con- 
stitute a linearly independent collection. This happens exactly if the 
matrix Ag has linearly independent rows, where B = {1,2,...,n}\N, 
and we are back at the definition of a basic feasible solution for the 
equational form. 

For a general linear program none of the optimal solutions have to 
be basic, as is illustrated by the linear program 


maximize ©; + x2 subject to 71 + 22 <1. 


This contrasts with the situation for the equational form (cf. Theo- 
rem 4.2.3) and it is one of the advantages of the equational form. 


Vertices and extremal points. The intuitive notion of a “tip” of 
a convex set can be viewed mathematically in at least two ways. One 
of them is captured by the above definition of a vertex of a convex 
polyhedron: A tip is a point for which some linear function attains a 
unique maximum. The other one leads to a definition talking about 
points that cannot be “generated by segments.” These are called ex- 
tremal points; thus a point x is an extremal point of a convex set 
C CR” if x € C and there are no two points y,z € C different from x 
such that x lies on the segment yz. 

For a convex polyhedron it is not difficult to show that the extremal 
points are exactly the vertices. Hence we have yet another equivalent 
description of a basic feasible solution. 


A convex polytope is the convex hull of its vertices. A gen- 
eral convex polyhedron need not have any vertices at all—consider 
a half-space. However, a convex polytope P, i.e., a bounded convex 
polyhedron, always has vertices, and even more is true: P equals the 
convex hull of the set of its vertices. This may look intuitively obvi- 
ous from examples in dimensions 2 and 3, but a proof is nontrivial 
(Ziegler’s book cited in the previous section calls this the “Main The- 
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orem” of polytope theory). Consequently, every convex polytope can 
be represented either as the intersection of finitely many half-spaces 
or as the convex hull of finitely many points. 


5. The Simplex Method 


In this chapter we explain the simplex method for solving linear programs. 
We will make use of the terms equational form and basic feasible solution 
from the previous chapter. 

Gaussian elimination in linear algebra has a fundamental theoretical and 
didactic significance, as a starting point for further developments. But in 
practice it has mostly been replaced by more complicated and more efficient 
algorithms. Similarly, the basic version of the simplex method that we discuss 
here is not commonly used for solving linear programs in practice. We do not 
put emphasis on the most efficient possible organization of the computations, 
but rather we concentrate on the main ideas. 


5.1 An Introductory Example 


We will first show the simplex method in action on a small concrete example, 
namely, on the following linear program: 


Maximize 21 + £2 
subject to -2, + 42 <1 
x2 < 2 

X1, 02 = 0. 


We intentionally do not take a linear program in equational form: The vari- 
ables are nonnegative, but the inequalities have to be replaced by equations, 
by introducing slack variables. The equational form is 


maximize t+ «4 
subject to —-2, + a2 + a3 = 
Ly + 2X4 ee) 
v2 + tw = 2 
U1,%Q,.--,05 > 0, 


with the matrix 
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In the simplex method we first express each linear program in the form 
of a simplex tableau. In our case we begin with the tableau 


vw =1l+ a, —- 2 
LA = 3 - Ly 

wh = 2 — @ 
A= “1, + XL 


The first three rows consist of the equations of the linear program, in which 
the slack variables have been carried over to the left-hand side and the re- 
maining terms are on the right-hand side. The last row, separated by a line, 
contains a new variable z, which expresses the objective function. 

Each simplex tableau is associated with a certain basic feasible solution. 
In our case we substitute 0 for the variables x, and x2 from the right-hand 
side, and without calculation we see that 73 = 1,24 = 3,275 = 2. This feasible 
solution is indeed basic with B = {3,4,5}; we note that Ag is the identity 
matrix. The variables x3, x4, x5 from the left-hand side are basic and the vari- 
ables x1, £2 from the right-hand side are nonbasic. The value of the objective 
function z = 0 corresponding to this basic feasible solution can be read off 
from the last row of the tableau. 

From the initial simplex tableau we will construct a sequence of tableaus of 
a similar form, by gradually rewriting them according to certain rules. Each 
tableau will contain the same information about the linear program, only 
written differently. The procedure terminates with a tableau that represents 
the information so that the desired optimal solution can be read off directly. 

Let us go to the first step. We try to increase the value of the objective 
function by increasing one of the nonbasic variables 71 or x2. In the above 
tableau we observe that increasing the value of x; (ie. making x positive) 
increases the value of z. The same is true for x2, because both variables have 
positive coefficients in the z-row of the tableau. We can choose either 27, 
or £2; let us decide (arbitrarily) for x2. We will increase it, while x, will stay 
0. 


By how much can we increase x2? If we want to maintain feasibility, we 
have to be careful not to let any of the basic variables x3, 24, 25 go below zero. 
This means that the equations determining x3, 74, %5 may limit the increment 
of x2. Let us consider the first equation 


v3 =14+%41—-22. 


Together with the implicit constraint x3 > 0 it lets us increase x2 up to the 
value x2 = 1 (while keeping x, = 0). The second equation 


w4 = 3- Ly 
does not limit the increment of x2 at all, and the third equation 


tp = 2-22 


5.1 An Introductory Example 59 


allows for an increase of x2 up to xg = 2 before x5 gets negative. The most 
stringent restriction thus follows from the first equation. 

We increase x2 as much as we can, obtaining xz = 1 and x3 = 0. From the 
remaining equations of the tableau we get the values of the other variables: 


wm = 3-27,=3 
a= 2-—%=1. 


In this new feasible solution x3 became zero and x2 nonzero. Quite natu- 
rally we thus transfer x3 to the right-hand side, where the nonbasic variables 
live, and x2 to the left-hand side, where the basic variables reside. We do it 
by means of the most stringent equation x3 = 1+ 21 — 22, from which we 
express 

w2 = 1 + 41 - v3. 
We substitute the right-hand side for x2 into the remaining equations, and 
we arrive at a new tableau: 


v2 = 1+ L1 — 2&3 
4 = 3 - XY 

wy = 1- r+ £3 
z 1 + 2a, — 23 


Here B = {2,4,5}, which corresponds to the basic feasible solution x = 
(0,1,0,3,1) with the value of the objective function z = 1. 

This process of rewriting one simplex tableau into another is called a 
pivot step. In each pivot step some nonbasic variable, in our case x2, enters 
the basis, while some basic variable, in our case x3, leaves the basis. 

In the new tableau we can further increase the value of the objective 
function by increasing x1, while increasing x3 would lead to a smaller z-value. 
The first equation does not restrict the increment of x; in any way, from the 
second one we get x; < 3, and from the third one x; < 1, so the strictest 
limitation is implied by the third equation. Similarly as in the previous step, 
we express x; from it and we substitute this expression into the remaining 
equations. Thereby x, enters the basis and moves to the left-hand side, and 
x5 leaves the basis and migrates to the right-hand side. The tableau we obtain 
is 


with B = {1,2,4}, basic feasible solution x = (1,2,0,2,0), and z = 3. After 
one more pivot step, in which x3 enters the basis and x4 leaves it, we arrive 


at the tableau 
Ly = 3 - v4 


8 
wo 
| 
bo 
| 
8 
a 


wz = 2-44 4+ 5 
= 5 — @%4 — &s5 


x 
| 
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with basis {1, 2,3}, basic feasible solution x = (3,2,2,0,0), and z = 5. In this 
tableau, no nonbasic variable can be increased without making the objective 
function value smaller, so we are stuck. Luckily, this also means that we have 
already found an optimal solution! Why? 

Let us consider an arbitrary feasible solution X = (%1,...,%5) of our 
linear program, with the objective function attaining some value z. Now x 
and Z satisfy all equations in the final tableau, which was obtained from the 
original equations of the linear program by equivalent transformations. Hence 
we necessarily have 

Z =5-—44—Hs. 


Together with the nonnegativity constraints %4,%5 > 0 this implies z < 5. 
The tableau even delivers a proof that x = (3,2,2,0,0) is the only optimal 
solution: If z = 5, then x4 = x5 = 0, and this determines the values of the 
remaining variables uniquely. 

A geometric illustration. For each feasible solution (#1, x2) of the original 
linear program (5.1) with inequalities we have exactly one corresponding 
feasible solution (21, 22,...,25) of the modified linear program in equational 
form, and conversely. The sets of feasible solutions are isomorphic in a suitable 
sense, and we can thus follow the progress of the simplex method narrated 
above in a planar picture for the original linear program (5.1): 


%1<3 
—a% +42 <1 
(3, 2) 
----------pe 
tq <2 
r2 >0 


(0,0) 


We can see the simplex method moving along the edges from one feasible 
solution to another, while the value of the objective function grows until it 
reaches the optimum. In the example we could also take a shorter route if we 
decided to increase x; instead of x2 in the first step. 


Potential troubles. In our modest example the simplex method has run 
smoothly without any problems. In general we must deal with several com- 
plications. We will demonstrate them on examples in the next sections. 
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5.2 Exception Handling: Unboundedness 


What happens in the simplex method for an unbounded linear program? We 
will show it on the example 


maximize Ly 
subject to @—- #2 <1 
—%, + Xo < 2 
1,22 20 
illustrated in the picture below: 
Uy > 0 
& 
—%1 +242 <2 
v2 > 0 
t1— % < 1 


After the usual transformation to equational form by introducing slack vari- 
ables 73,274, we can use these variables as a feasible basis and we obtain the 
initial simplex tableau 


zw =1l-— 244 2 
tmq=2+ 41 - 2X2 
AZ = Ly 


After the first pivot step with entering variable x; and leaving variable x3 
the next tableau is 


Ly = 1+ 2 — 
t4 = 3 — 3 
z =14+ % - 23 


If we now try to introduce x2 into the basis, we discover that none of the 
equations in the tableau restrict its increase in any way. We can thus take 
x2 arbitrarily large, and we also get z arbitrarily large—the linear program 
is unbounded. 

Let us analyze this situation in more detail. From the tableau one can see 
that for an arbitrarily large number t > 0 we obtain a feasible solution by 
setting r2 = t, v3 = 0, 2, = 1+t, and x4 = 3, with the value of the objective 
function z = 1+. In other words, the semi-infinite ray 


{(1,0,0,3) + ¢(1,1,0,0) :t > 0} 
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is contained in the set of feasible solutions. It “witnesses” the unbounded- 
ness of the linear program, since the objective function attains arbitrarily 
large values on it. The corresponding semi-infinite ray for the original two- 
dimensional linear program is drawn thick in the picture above. 

A similar ray is the output of the simplex method for all unbounded linear 
programs. 


5.3 Exception Handling: Degeneracy 


While we can make some nonbasic variable arbitrarily large in the unbounded 
case, the other extreme happens in a situation called a degeneracy: The equa- 
tions in a tableau do not permit any increment of the selected nonbasic vari- 
able, and it may actually be impossible to increase the objective function z 
in a single pivot step. 

Let us consider the linear program 


maximize L2 
subject to -2, + x2 < 0 
- ac (5.2) 
L1,Xq > 0. 
Ly > 0 
Ly < 2 
—%+ 2X2 < 0 
vg > 0 


In the usual way we convert it to equational form and construct the initial 
tableau 


w3 = v1, — xr 
4 = 2 — 24 
ZA = v2 


The only candidate for entering the basis is x2, but the first row of the 
tableau shows that its value cannot be increased without making v3 negative. 
Unfortunately, the impossibility of making progress in this case does not 
imply optimality, so we have to perform a degenerate pivot step, i.e., one 
with zero progress in the objective function. In our example, bringing x2 into 
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the basis (with x3 leaving) results in another tableau with the same basic 
feasible solution (0,0, 0, 2): 


Ot ed XL — £3 
tm=2- 41 
a = L1 — @ 


Nevertheless, the situation has improved. The nonbasic variable x, can now 
be increased, and by entering it into the basis (replacing x4) we already obtain 
the final tableau 


r= 2 — @4 
t= 2-— 473 - 1% 
Zz =2—-— 273 — 44 


with an optimal solution x = (2, 2,0,0). 

A situation that forces a degenerate pivot step may occur only for a linear 
program in which several feasible bases correspond to a single basic feasible 
solution. Such linear programs are called degenerate. 

It is easily seen that in order that a single basic feasible solution be ob- 
tained from several bases, some of the basic variables have to be zero. 

In this example, after one degenerate pivot step we could again make 
progress. In general, there might be longer runs of degenerate pivot steps. It 
may even happen that some tableau is repeated in a sequence of degenerate 
pivot steps, and so the algorithm might pass through an infinite sequence 
of tableaus without any progress. This phenomenon is called cycling. An 
example of a linear program for which the simplex method may cycle can be 
found in Chv&atal’s textbook cited in Chapter 9 (the smallest possible example 
has 6 variables and 3 equations), and we will not present it here. 

If the simplex method doesn’t cycle, then it necessarily finishes in a finite 
number of steps. This is because there are only finitely many possible simplex 
tableaus for any given linear program, namely at most C2), which we will 
prove in Section 5.5. 

How can cycling be prevented? This is a nontrivial issue and it will be 
discussed in Section 5.8. 


5.4 Exception Handling: Infeasibility 


In order that the simplex method be able to start at all, we need a feasible 
basis. In examples discussed up until now we got a feasible basis more or less 
for free. It works this way for all linear programs of the form 


maximize c’x subject to Ax < b and x > 0 


with b > O. Indeed, the indices of the slack variables introduced in the 
transformation to equational form can serve as a feasible basis. 
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However, in general, finding any feasible solution of a linear program is 
equally as difficult as finding an optimal solution (see the remark in Sec- 
tion 1.3). But computing the initial feasible basis can be done by the simplex 
method itself, if we apply it to a suitable auxiliary problem. 

Let us consider the linear program in equational form 


maximize 2, + 27% 
subject to a 322 3° 4 
222 Tr ty = 2 


21, 22,03 = 0. 


Let us try to produce a feasible solution starting with (a1, 22,23) = (0,0,0). 
This vector is nonnegative, but of course it is not feasible, since it does not 
satisfy the equations of the linear program. We introduce auxiliary variables 
x4 and x5 as “corrections” of infeasibility: x4 = 4-21 — 322 — 23 expresses by 
how much the original variables x1, ©2, x3 fail to satisfy the first equation, and 
@5 = 2— 2x42 — 43 plays a similar role for the second equation. If we managed 
to find nonnegative values of 21, 2%2,x3 for which both of these corrections 
come out as zeros, we would have a feasible solution of the considered linear 
program. 

The task of finding nonnegative x1, 22,x%3 with zero corrections can be 
captured by a linear program: 


Maximize — 4 — 2&5 
subject to a + 322 oi v3 + U4 =4 
222 Tr x3 + 25 = 2 
U1,%2,...,%5 > 0. 


The optimal value of the objective function —x4 — x5 is 0 exactly if there 
exist values of x1,2%2,23 with zero corrections, i.e., a feasible solution of the 
original linear program. 

This is the right auxiliary linear program. The variables 74 and x5 form 
a feasible basis, with the basic feasible solution (0,0,0,4,2). (Here we use 
that the right-hand sides, 4 and 2, are nonnegative, but since we deal with 
equations, this can always be achieved by sign changes.) Once we express 
the objective function using the nonbasic variables, that is, in the form z = 
—6+2,+52x2+ 223, we can start the simplex method on the auxiliary linear 
program. 

The auxiliary linear program is surely bounded, since the objective func- 
tion cannot be positive. The simplex method thus computes a basic feasible 
solution that is optimal. 

As training the reader can check that if we let x; enter the basis in the 
first pivot step and x3 in the second, the final simplex tableau comes out as 


ty = 2- v2 —-— 4 + v5 
w3 = 2—- 2X2 — 
a4 = — 4 — Xs. 
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The corresponding optimal solution (2,0,2,0,0) yields a basic feasible solu- 
tion of the original linear program: (21, 22,23) = (2,0, 2). The initial simplex 
tableau for the original linear program can even be obtained from the final 
tableau of the auxiliary linear program, by leaving out the columns of the 
auxiliary variables x4 and 25,' and by changing the objective function back 
to the original one, expressed in terms of the nonbasic variables: 


r= 2—- v2 
t3 = 2—- 222 
te) He Oa Es) 


Starting from this tableau, a single pivot step already reaches the optimum. 


5.5 Simplex Tableaus in General 


In this section and the next one we formulate in general, and mostly with 
proofs, what has previously been explained on examples. 
Let us consider a general linear program in equational form 


maximize c’x subject to Ax = b and x > 0. 


The simplex method applied to it computes a sequence of simplex tableaus. 
Each of them corresponds to a feasible basis B and it determines a basic 
feasible solution, as we will soon verify. (Let us recall that a feasible basis is 
an m-element set B C {1,2,...,n} such that the matrix Ag is nonsingular 
and the (unique) solution of the system Agxg = b is nonnegative.) 

Formally, we will define a simplex tableau as a certain system of linear 
equations of a special form, in which the basic variables and the variable z, 
representing the value of the objective function, stand on the left-hand side 
and they are expressed in terms of the nonbasic variables. 


A simplex tableau 7 (B) determined by a feasible basis B is a system 
of m+1 linear equations in variables 71, 22,...,%,, and z that has the 
same set of solutions as the system Ax = b, z = c’x, and in matrix 
notation looks as follows: 


xB = t QXN 


2 =2 + r'xy 

where xg is the vector of the basic variables, N = {1,2,...,n}\B, xn is 
the vector of nonbasic variables, p € R™, r € R°~™, Q is an mx(n—m) 
matrix, and zp € R. 


' Tt may happen that some auxiliary variables are zero but still basic in the final 
tableau of the auxiliary program, and so they cannot simply be left out. Section 
5.6 discusses this (easy) issue. 
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The basic feasible solution corresponding to this tableau can be read off 
immediately: It is obtained by substituting x,y = 0; that is, we have xp = p. 
From the feasibility of the basis B we see that p > 0. The objective function 
for this basic feasible solution has value zg + r70 = 20. 

The values of p, Q,r, Zo can easily be expressed using B and A, b,c: 


5.5.1 Lemma. For each feasible basis B there exists exactly one simplex 
tableau, and it is given by 


Q= —A;'An, p= Aj;'b, Zo = cp Az'b, and r=cy — (cBA, An)’. 


It is neither necessary nor very useful to remember these formulas; they 
are easily rederived if needed. The proof is not very exciting and we write it 
more concisely than other parts of the text and we leave some details to a 
diligent reader. We will proceed similarly with subsequent proofs of a similar 
kind. 


Proof. First let us see how these formulas can be discovered: We 
rewrite the system Ax = b to Agxg = b— Anxy, and we multiply it 
by the inverse matrix Age from the left (these transformations preserve 
the solution set), which leads to 


XB = Aj;'b -— Ap ANXN. 
We substitute the right-hand side for xg into the equation z = ec? x = 
ChXB + clxNs and we obtain 


a= cp (Ap b = Ap’ Ayxw) + ch xn 
chAz b+ (chy = és AG An )Xn. 


I 


Thus the formulas in the lemma do yield a simplex tableau, and it 
remains to verify the uniqueness. 

Let p,Q,r, 2 determine a simplex tableau for a feasible basis B, 
and let p’,Q’,r’, 24 do as well. Since each choice of xy determines 
Xp uniquely, the equality p+ Qxy = p’ + Q’xn has to hold for all 
xy € R"~™. The choice xy = 0 gives p = p’, and if we substitute the 
unit vectors e; of the standard basis for xy one by one, we also get 
Q = Q’. The equalities zo = z) and r =r’ are proved similarly. 


5.6 The Simplex Method in General 


Optimality. Exactly as in the concrete example in Section 5.1, we have the 
following criterion of optimality of a simplex tableau: 
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If T(B) is a simplex tableau such that the coefficients of the nonbasic 
variables are nonpositive in the last row, i.e., if 


r<0O 


2 


then the corresponding basic feasible solution is optimal. 


Indeed, the basic feasible solution corresponding to such a tableau has the 
objective function equal to z9, while for any other feasible solution x we have 
Ky > O and c?& = 2% +r? KN < 2%. 

A pivot step: who enters and who leaves. In each step of the simplex 
method we go from an “old” basis B and simplex tableau T(B) to a “new” 
basis B’ and the corresponding simplex tableau T(B’). A nonbasic variable 
2, enters the basis and a basic variable x, leaves the basis,? and hence B’ = 
(B\ {u}) U {uv}. 


We always select the entering variable x, first. 


A nonbasic variable may enter the basis if and only if its coefficient in 


the last row of the simplex tableau is positive. 


Only incrementing such nonbasic variables increases the value of the objective 
function. 

Usually there are several positive coefficients in the last row, and hence 
several possible choices of the entering variable. For the time being the reader 
may think of this choice as arbitrary. We will discuss ways of selecting one of 
these possibilities in Section 5.7. 

Once we decide that the entering variable is some x,, it remains to pick 
the leaving variable. 


The leaving variable x, has to be such that its nonnegativity, together 
with the corresponding equation in the simplex tableau having x, on 


the left-hand side, limits the increment of the entering variable x, most 
strictly. 


Expressed by a formula, this condition might look complicated because 
of some double indices, but the idea is simple and we have already seen it 
in examples. Let us write B = {ki,ko,...,km}, ki < ko < +++ < km, and 
N = {4), €2,..-,€n—-m}, €1 < 2 < +++ < n—m. Then the ith equation of the 
simplex tableau has the form 


n—-m 


Lk, = Pi + a Gig Le; - 
j=l 


? The letters u and v do not denote vectors here (the alphabet is not that long, 
after all). 
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We now want to write the index v of the chosen entering variable as v = @,. 
In more detail, we define @ € {1,2,...,2—m} as the index for which v = é. 
Similarly, the index u of the leaving variable (which hasn’t been selected yet) 
will be written in the form u = kg. 

Since all nonbasic variables x,, 7 # 3, should remain zero, the nonneg- 
ativity condition x,, > 0 limits the possible values of the entering variable 
xe, by the inequality —qgae, < pi. If gig = 0, then this inequality doesn’t 
restrict the increase of xg, in any way, while for gig < 0 it yields the restric- 
tion xe, < —pi/ dig. 

The leaving variable x, is thus always such that 


dag <0 and = Pe nin {PE gig <0,1=1,2,..4mh. (5.3) 
dap ip 


That is, in the simplex tableau we consider only the rows in which the coeffi- 
cient of x, is negative. In such rows we divide by this coefficient the compo- 
nent of the vector p, we change sign, and we seek a minimum among these 
ratios. If there is no row with a negative coefficient of xy, i.e., the minimum 
of the right-hand side of equation (5.3) is over an empty set, then the linear 
program is unbounded and the computation finishes. 

For a proof that the simplex method really goes through a sequence of 
feasible bases we need the following lemma. 


5.6.1 Lemma. If B is a feasible basis and T(B) is the corresponding simplex 
tableau, and if the entering variable x, and the leaving variable x, have been 
selected according to the criteria described above (and otherwise arbitrarily), 
then B' = (B \ {u}) U {v} is again a feasible basis. 

If no x, satisfies the criterion for a leaving variable, then the linear pro- 
gram is unbounded. For allt > 0 we obtain a feasible solution by substituting 
t for x, and 0 for all other nonbasic variables, and the value of the objective 
function for these feasible solutions tends to infinity as t — co. 


The proof is one of those not essential for a basic understanding of the 
material. 


Proof (sketch). We first need to verify that the matrix Ag, is non- 
singular. This holds exactly if AR Ap is nonsingular, since we as- 
sume nonsingularity of Ag. The matrix Ag: agrees with Ag in m—1 
columns corresponding to the basic variable indices B \ {u}. For the 
basic variable with index k;,i 4 a, we get the unit vector e;, in the 
corresponding column of Aj’ Ap:. 

The negative of the remaining column of the matrix Az A Br occurs 
in the simplex tableau T(B) as the column of the entering variable zy, 
since Q = —Az An by Lemma 5.5.1. There is a nonzero number qag in 
row a corresponding to the leaving variable x,,, since we have selected 
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Ly, that way, and the other columns of Az Ap: have 0 in that row. 
Hence the matrix is nonsingular as claimed. 

Next, we need to check feasibility of the basis B’. Here we use the 
fact. that the new basic feasible solution, that for B’, can be written 
in terms of the old one, and the nonnegativity of its basic variables 
are exactly those conditions that are used for choosing the leaving 
variable. 

In practically the same way one can show the part of the lemma 
dealing with unbounded linear programs. We omit further details. 


A geometric view. As we saw in Section 4.4, basic feasible solutions 
are vertices of the polyhedron of feasible solutions. It is not hard to 
verify that a pivot step of the simplex method corresponds to a move 
from one vertex to another along an edge of the polyhedron (where an 
edge is a 1-dimensional face, i.e., a segment connecting the considered 
vertices; see Section 4.4). 

Degenerate pivot steps are an exception, where we stay at the 
same vertex and only the feasible basis changes. A vertex of an n- 
dimensional convex polyhedron is generally determined by n of the 
bounding hyperplanes (think of a 3-dimensional cube, say). Degener- 
acy can occur only if we have more than n of the bounding hyperplanes 
meeting at a vertex (this happens for the 3-dimensional regular octa- 
hedron, for example). 


Organization of the computations. Whenever we find a new feasible 
basis as above, we could compute the new simplex tableau according to the 
formulas from Lemma 5.5.1. But this is never done since it is inefficient. 

For hand calculation the new simplex tableau is computed from the old 
one. We have already illustrated one possible approach in the examples. We 
take the equation of the old tableau with the leaving variable x, on the left, 
and in this equation we carry the entering variable x, over to the left and 
Zy, to the right. The modified equation becomes the equation for x, in the 
new tableau. The right-hand side is then substituted for x, into all of the 
other equations, including the one for z in the last row. This finishes the 
construction of the new tableau. 


In computer implementations of the simplex method, the simplex 
tableau is typically not computed in full. Rather, only the basic com- 
ponents of the basic feasible solution, i.e., the vector p = Ap'b, and 
the matrix AG are maintained. The latter allows for a fast computa- 
tion of other entries of the simplex tableau when they are needed. (Let 
us note that for the optimality test and for selecting the entering vari- 
able we need only the last row, and for selecting the leaving variable 
we need only p and the column of the entering variable.) With respect 
to efficiency and numerical accuracy, the explicit inverse Aes is not 
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the best choice, and in practice, it is often represented by an (approxi- 
mate) LU-factorization of the matrix Ag, or by other devices that can 
easily be updated during a pivot step of the simplex method. Since an 
efficient implementation of the simplex method is not among our main 
concerns, we will not describe how these things are actually done. 

This computational approach is called the revised simplex method. 
For m considerably smaller than n it is usually much more efficient 
than maintaining all of the simplex tableau. In particular, O(m?) 
arithmetic operations per pivot step are sufficient for maintaining an 
LU-factorization of Ag, as opposed to about mn operations required 
for maintaining the simplex tableau. 


Computing an initial feasible basis. If the given linear program has no 
“obvious” feasible basis, we look for an initial feasible basis by the procedure 
indicated in Section 5.4. For a linear program in the usual equational form 


maximize c’ x subject to Ax = b and x > 0 


we first arrange for b > 0: We multiply the equations with b; < 0 by —1. 
Then we introduce m new variables 7,41 through r,4m, and we solve the 
auxiliary linear program 


maximize —(%n41 + @nz2+++-+ Ley) 
subject to Ax=b 
X>0, 
where X = (21,...,%n+m) is the vector of all variables including the new 


ones, and A = (A|J,,) is obtained from A by appending the mxm identity 
matrix to the right. The original linear program is feasible if and only if every 
optimal solution of the auxiliary linear program satisfies tn41 = @n4o=-° = 
Intm = 0. Indeed, it is clear that an optimal solution of the auxiliary linear 
program with @n41 = In+2 ‘++ = Ln4+m = 0 yields a feasible solution of 
the original linear program. Conversely, any feasible solution of the original 
linear program provides a feasible solution of the auxiliary linear program 
that has the objective function equal to 0 and is thus optimal. 

The auxiliary linear program can be solved by the simplex method di- 
rectly, since the new variables 2,4; through tj+4m constitute an initial fea- 
sible basis. In this way we obtain some optimal solution. If it doesn’t satisfy 


Inti = Cn42 =++: = Lnim = 0, we are done—the original linear program is 
infeasible. 

Let us assume that the optimal solution of the auxiliary linear program 
has p41 = Ln42 =-++: = Ln+m = 0. The simplex method always returns a 


basic feasible solution. If none of the new variables 7,4) through ¢pj4m are 
in the basis for the returned optimal solution, then such a basis is then a 
feasible basis for the original linear program, too, and it allows us to start 
the simplex method. 
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In some degenerate cases it may happen that the basis returned by the 
simplex method for the auxiliary linear program contains some of the vari- 
ables %n41,---,£n+m, and such a basis cannot directly be used for the original 
linear program. But this is a cosmetic problem only: From the returned op- 
timal solution one can get a feasible basis for the original linear program by 
simple linear algebra. Namely, the optimal solution has at most m nonzero 
components, and their columns in the matrix A are linearly independent. 
If these columns are fewer than m, we can add more linearly independent 
columns and thus get a basis; see the proof of Lemma 4.2.1. 


5.7 Pivot Rules 


A pivot rule is a rule for selecting the entering variable if there are several 
possibilities, which is usually the case. Sometimes there may also be more 
than one possibility for choosing the leaving variable, and some pivot rules 
specify this choice as well, but this part is typically not so important. 

The number of pivot steps needed for solving a linear program depends 
substantially on the pivot rule. (See the example in Section 5.1.) The problem 
is, of course, that we do not know in advance which choices will be good in 
the long run. 

Here we list some of the common pivot rules. By an “improving variable” 
we mean any nonbasic variable with a positive coefficient in the z-row of the 
simplex tableau, in other words, a candidate for the entering variable. 


LARGEST COEFFICIENT. Choose an improving variable with the largest co- 
efficient in the row of the objective function z. This is the original rule, sug- 
gested by Dantzig, that maximizes the improvement of z per unit increase of 
the entering variable. 


LARGEST INCREASE. Choose an improving variable that leads to the largest 
absolute improvement in z. This rule is computationally more expensive than 
the LARGEST COEFFICIENT rule, but it locally maximizes the progress. 


STEEPEST EDGE. Choose an improving variable whose entering into the 
basis moves the current basic feasible solution in a direction closest to the 
direction of the vector c. Written by a formula, the ratio 


cP (Xnew = Xold) 


\|Xnew _ Xolal| 


should be maximized, where Xo1q is the basic feasible solution for the current 
simplex tableau and Xnew is the basic feasible solution for the tableau that 
would be obtained by entering the considered improving variable into the 
basis. (We recall that ||v|| = (v? + v3 +--- + v2)!/? = VvTv denotes the 
Euclidean length of the vector v, and the expression u? v/(||ul| - || v||) is the 
cosine of the angle of the vectors u and v.) 
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The STEEPEST EDGE rule is a champion among pivot rules in practice. 
According to extensive computational studies it is usually faster than all 
other pivot rules described here and many others. An efficient approximate 
implementation of this rule is discussed in the glossary under the heading 
“Devex.” 


BLAND’S RULE. Choose the improving variable with the smallest index, and 
if there are several possibilities for the leaving variable, also take the one with 
the smallest index. BLAND’S RULE is theoretically very significant since it 
prevents cycling, as we will discuss in Section 5.8. 


RANDOM EDGE. Select the entering variable uniformly at random among all 
improving variables. This is the simplest example of a randomized pivot rule, 
where the choice of the entering variable uses random numbers in some way. 
Randomized rules are also very important theoretically, since they lead to the 
current best provable bounds for the number of pivot steps of the simplex 
method. 


5.8 The Struggle Against Cycling 


As we have already mentioned, it may happen that for some linear programs 
the simplex method cycles (and theoretically this is the only possibility of 
how it may fail). Such a situation is encountered very rarely in practice, if at 
all, and thus many implementations simply ignore the possibility of cycling. 

There are several ways that provably avoid cycling. One of them is the 
already mentioned BLAND’S RULE: We prove below that the simplex method 
never cycles if Bland’s rule is applied consistently. Unfortunately, regarding 
efficiency, Bland’s rule is one of the slowest pivot rules and it is almost never 
used in practice. 


Another possibility can be found in the literature under the head- 
ing lexicographic rule, and here we only sketch it. 

Cycling can occur only for degenerate linear programs. Degeneracy 
may lead to ties in the choice of the leaving variable. The lexicographic 
method breaks these ties as follows. Suppose that we have a set S of 
row indices such that for all a € S, 


dag <0 and ~ Pe min {2 gig < 0,1 =1,2,...4mh. 
dap iB 


In other words, all indices in S$ are candidates for the leaving variable. 
We then choose the index a € S' for which the vector 


( da(n—m) ) 
dap : : daB 
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is the smallest in the lexicographic ordering. (We recall that a vector 
x € R* is lexicographically smaller than a vector y € R* if x; < y,, 
or if x; = y; and x2 < ye, etc., in general, if there is an index 7 < k 
such that 7] = y1,..., 2-1 = yj-1 and x; < y;.) Since the matrix A 
has rank m, it can be checked that any two of those vectors indeed 
differ at some index, and so we can resolve ties between any set S of 
rows. The chosen row index determines the leaving variable. 

It can be shown that under the lexicographic rule, cycling is im- 
possible. In very degenerate cases the lexicographic rule can be quite 
costly, since it may have to compute many components of the afore- 
mentioned vectors before the ties can eventually be broken. 

Geometrically, the lexicographic rule has the following interpreta- 
tion. For linear programs in equational form, degeneracy means that 
the set F' of solutions of the system Ax = b contains a point with more 
than n — m zero components, and thus it is not in general position 
with respect to coordinate axes. The lexicographic rule has essentially 
the same effect as a well-chosen perturbation of the set F’, achieved by 
changing the vector b a little. This brings F' into “general position” 
and therefore resolves all ties, while the optimal solution changes only 
by very little. The lexicographic rule simulates the effects of a suitable 
“infinitesimal” perturbation. 

Now we return to Bland’s rule. 


5.8.1 Theorem. The simplex method with Bland’s pivot rule (the entering 
variable is the one with the smallest index among the eligible variables, and 
similarly for the leaving variable) is always finite; i.e., cycling is impossible. 


This is a basic result in the theory of linear programming (the duality 
theorem is an easy consequence, for example). Unfortunately, the proof is 
somewhat demanding. Its plot is simple, though: Assuming that there is a 
cycle, we get a contradiction in the form of an auxiliary linear program that 
has an optimal solution and is unbounded at the same time. 


Proof. We assume that there is a cycle, and we let the set F' consist of the 
indices of all variables that enter (and therefore also leave) the basis at least 
once during the cycle. We call these the fickle variables. First we verify a 
general claim about cycling of the simplex method, valid for any pivot rule. 


Claim. All bases encountered in the cycle yield the same basic feasible solu- 
tion, and all the fickle variables are 0 in it. 


Proof of the claim. Since the objective function never decreases, it has to 
stay constant along the cycle. 

Let B be a feasible basis encountered along the cycle, let N = {1,2,...,n}\ 
B as usual, and let B’ = (B \ {u}) U {vu} be the next basis. The only one 
among the nonbasic variables that may possibly change value in the pivot 
step from B to B’ is the entering variable x,; all others remain nonbasic 
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and thus 0. By the rule for selecting the entering variable, the coefficient of 
“Ly in the z-row of the tableau T(B) (i.e., in the vector r) is strictly pos- 
itive. Since the objective function is given by z = z + r’ xy, we see that 
if x, became strictly positive, the objective function would increase. Hence 
the basic feasible solutions corresponding to B and B’, respectively, agree in 
all components in N. Since these components determine the remaining ones 
uniquely (Proposition 4.2.2), the basic feasible solution does not change at 
all. 

Finally, since every fickle variable is nonbasic at least once during the 
cycle, it has to be 0 all the time. The claim is proved. 


The first trick in the proof of Theorem 5.8.1 is to consider the largest 
index v in the set F’. Let B be a basis in the cycle just before x, enters, and 
B’ another basis just before x, leaves (and 2, enters, say). Let p,Q,r, 20 
be the parameters of the simplex tableau T(B), and let p’,Q’,r’, 26 be the 
parameters of T(B’). (We remark that neither B nor B’ has to be determined 
uniquely.) 

Next, we use Bland’s rule to infer some properties of the tableaus T(B) 
and 7 (B’). First we focus on the situation at B. As in Section 5.6, we write 
Band N = {1,2,...,n}\ B as ordered sets: B = {ky,ko,...,km}, ki < 
kg < +++ < km, and N = {61, €0,...,€n—-m}, 1 < l2 < +++ < lnm. Since 
we have chosen v as the largest index in F’, and Bland’s rule requires v to 
be the smallest index of a candidate for entering the basis, no other fickle 
variable is a candidate at this point. Thus all fickle variables except for x, 
have nonpositive coefficients in the z-row of T(B). Expressed formally, if ( is 
the index such that v = &g, we have 


rg >O andr; <0 for all j such that 0; € F \ {v}. (5.4) 


Second, we consider the tableau T(B’). We write B’ = {ki,k5,..., kK, }, 
N’ = {1,2,...,n}\ BY = {4,4,...,4, -m}, we let a’ be the index of the 
leaving variable x, in B’, ie., the one with k/, = v, and we let {’ be the 
index of the entering variable x, in N’, i.e., the one with (3, = u. By the 
same logic as above, xy is the only candidate for leaving the basis among all 
the basic fickle variables in T(B’). Recalling the criterion (5.3) for leaving the 
basis, we get that i = a’ is the only 7 with kj € F and qj,, < 0 that minimizes 
the ratio —pj/q;4. Since p’ specifies the values of the basic variables and all 
fickle variables remain 0 during the cycle, we have p/, = 0 for alli with ki € F. 
Consequently, 


dorgr < O and qig > 0 for all i such that kj € F \ {v}. (5.5) 


The idea is now to construct an auxiliary linear program for which (5.4) 
proves that it has an optimal solution, while (5.5) shows that it is unbounded. 
This is a clear contradiction, which rules out the initial assumption, the 
existence of a cycle. 
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The auxiliary linear program is the following: 


Maximize c!x 


subject to Ax =b 


XF\{v} 20 
Ly <0 
XN\F = 0. 


We want stress that here the variables xg\~- may assume any signs. 


Optimality of the auxiliary linear program. We let x be the basic feasible 
solution of our original linear program associated with the basis B. Since 
Xv = 0 and Xp = 0 (by the claim), x is feasible for the auxiliary program. 
Moreover, for every x satisfying Ax = b the value of the objective function 
can be expressed as 


cx=2= Zo + r' xy. 


For all feasible solutions x of the auxiliary linear program, we have 


>0 iff;eF\{v} 
"G) <0 iff; =f5=0, 


and so (5.4) implies 
rj;ae, <0 for all j such that ¢; € F. 


Together with xy\ 7 = 0, we get r’xy <0, and hence z < zg for all feasible 
solutions of the auxiliary linear program. It follows that x is an optimal 
solution of the auxiliary linear program. 


Unboundedness of the auxiliary linear program. By the claim at the beginning 
of the proof, x is also the basic feasible solution of our original linear program 
associated with the basis B’. For all solutions x of Ax = b we have 


xpBr= p’ + Q'xn’. (5.6) 


Let us now change xy’, by letting Z,, grow from its current value 0 to some 
value t > 0. Using (5.6), this determines a new solution x(t) of Ax = b; we 
will show that for all t > 0, this solution is feasible for the auxiliary problem, 
but that the objective function value c7 X(t) tends to infinity as t > oo. Here 
are the details. 

oe 0 iff/eN’\ 

~ if £' u 
(0) = { t iff = ty =u. 


With Z, =0 and t > 0, (5.6) and (5.5) together show that 


ss Sig itt x, og Ife ERS BON 
iy () =x, + thor { 29 if =k, =v. 


al 
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In particular, X(t) is again feasible for the auxiliary linear program. 
Since the variable x, = ty, was a candidate for entering the basis B’, we 


know that 13) > 0, and hence 


ce X(t) = 2h +r X(t) = 2) + try, 3 00 for to. 


This means that the auxiliary linear program is unbounded. 


5.9 Efficiency of the Simplex Method 


In practice, the simplex method performs very satisfactorily even for large lin- 
ear programs. Computational experiments indicate that for linear programs 
in equational form with m equations it typically reaches an optimal solution 
in something between 2m and 3m pivot steps. 

It was thus a great surprise when Klee and Minty constructed a linear 
program with n nonnegative variables and n inequalities for which the simplex 
method with Dantzig’s original pivot rule (LARGEST COEFFICIENT) needs 
exponentially many pivot steps, namely 2” — 1! 


The set of feasible solutions is an ingeniously deformed n-dimens- 
ional cube, called the Klee—Minty cube, constructed in such a way that 
the simplex method passes through all of its vertices. It is not hard to 
see that there is a deformed n-dimensional cube with an x,,-increasing 
path, say, through all vertices. Instead of a formal description we il- 
lustrate such a construction by pictures for dimensions 2 and 3: 


The deformed cube is inscribed in an ordinary cube in order to better 
convey the shape. With some pivot rules, the simplex method may 
traverse the path marked with a thick line. The particular deformed 
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cube shown in the picture won’t fool Dantzig’s rule, for which the orig- 
inal example of Klee and Minty was constructed, though. A deformed 
cube that does fool Dantzig’s rule looks more bizarre: 


The direction of the objective function is drawn vertically. The corre- 
sponding linear program with n = 3 variables is simple: 


Maximize 9x, + 322 + 23 

subject to Ly <1 
621 +r «9 < 9 
1821 a 6x2 +T &3 < 81 


21, 22,73 = 0. 


It is instructive to see how, after the standard conversion to equational 
form, this linear program forces Dantzig’s rule to go through all feasi- 
ble bases. 


Later on, very slow examples of a similar type were discovered for many 
other pivot rules, among them all the rules mentioned above. Many people 
have tried to design a pivot rule and prove that the number of pivot steps is 
always bounded by some polynomial function of m and n, but nobody has 
succeeded so far. The best known bound has been proved for the following 
simple randomized pivot rule: Choose a random ordering of the variables at 
the beginning of the computation (in other words, randomly permute the 
indices of the variables in the input linear program); then use Bland’s rule 
for choosing the entering variable, and the lexicographic method for choosing 
the leaving variable. For every linear program with at most n variables and 
at most n constraints, the expected number of pivot steps is bounded by 
eCvninn where C is a (not too large) constant. (Here the expectation means 
the arithmetic average over all possible orderings of the variables.) This bound 
is considerably better than 2”, say, but much worse than a polynomial bound. 


This algorithm was found independently and almost at the same 
time by Kalai and by MatouSek, Sharir, and Welz]. For a recent treat- 
ment in a somewhat broader context see 


B. Gartner and E. Welzl: Explicit and implicit enforcing—ran- 
domized optimization, in Lectures of the Graduate Program 
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Computational Discrete Mathematics, Lecture Notes in Com- 
puter Science 2122 (2001), Springer, Berlin etc., pages 26-49. 


A very good bound is not known even for the cleverest possible 
pivot rule, let us call it the clairvoyant’s rule, that would always se- 
lect the shortest possible sequence of pivot steps leading to an optimal 
solution. The Hirsch conjecture, one of the famous open problems in 
mathematics, claims that the clairvoyant’s rule always reaches opti- 
mum in O(n) pivot steps. But the best result proved so far gives only 
the bound of n't”, due to 


G. Kalai and D. Kleitman: Quasi-polynomial bounds for the 
diameter of graphs of polyhedra, Bull. Amer Math. Soc. 
26(1992), 315-316. 


This is better than eCY"™”, but still worse than any polynomial func- 
tion of n, and it doesn’t provide a real pivot rule since nobody knows 
how to simulate clairvoyant’s decisions by an efficient algorithm. 

Here is an approach that looks promising and has been tried more 
recently, although without a clear success so far. One tries to modify 
the given linear program in such a way that polynomiality of a suitable 
pivot rule for the modified linear program would be easier to prove, and 
of course, so that an optimal solution of the original linear program 
could easily be derived from an optimal solution of the modified linear 
program. 

In spite of the Klee—Minty cube and similar artificial examples, 
the simplex method is being used successfully. Remarkable theoret- 
ical results indicate that these willful examples are rare indeed. For 
instance, it is known that if a linear program in equational form is 
generated in a suitable (precisely defined) way at random, then the 
number of pivot steps is of order at most m? with high probability. 
More recent results, in the general framework of the so-called smoothed 
complexity, claim that if we take an arbitrary linear program and then 
we change its coefficients by small random amounts, then the simplex 
method with a certain pivot rule reaches the optimum of the resulting 
linear program by polynomially many steps with high probability (a 
concrete bound on the polynomial depends on a precise specification 
of the “small random amounts” of change). The first theorem of this 
kind is due to Spielman and Teng, and for recent progress see 


R. Vershynin: Beyond Hirsch conjecture: Walks on ran- 
dom polytopes and the smoothed complexity of the simplex 
method, preprint, 2006. 


An exact formulation of these results requires a number of rather 
technical notions that we do not want to introduce here, and so we 
omit it. 


5.10 Summary 79 


5.10 Summary 


Let us review the simplex method once again. 
Algorithm SIMPLEX METHOD 
1. Convert the input linear program to equational form 
maximize c’x subject to Ax = b and x > 0 


with n variables and m equations, where A has rank m (see Section 4.1). 
2. If no feasible basis is available, arrange for b > 0, and solve the following 
auxiliary linear program by the simplex method: 


Maximize = —(@n41+2n42 +++: +2n+m) 
subject to Ax=b 
x>0, 
where £n41,..-,2n4m are new variables, ¥ = (1,...,2n+m), and A = 


(A| Im). If the optimal value of the objective function comes out negative, 
the original linear program is infeasible; stop. Otherwise, the first n 
components of the optimal solution form a basic feasible solution of the 
original linear program. 
3. For a feasible basis B C {1,2,...,n} compute the simplex tableau T(B), 
of the form 
Xp =p + Qxn 


2 = 2 + r’xn 


4. If r < O in the current simplex tableau, return an optimal solution 
(p specifies the basic components, while the nonbasic components are 
0); stop. 

5. Otherwise, select an entering variable x, whose coefficient in the vector 
r is positive. If there are several possibilities, use some pivot rule. 

6. If the column of the entering variable x, in the simplex tableau is non- 
negative, the linear program is unbounded; stop. 

7. Otherwise, select a leaving variable x,. Consider all rows of the simplex 
tableau where the coefficient of x, is negative, and in each such row divide 
the component of the vector p by that coefficient and change sign. The 
row of the leaving variable is one in which this ratio is minimal. If there 
are several possibilities, decide by a pivot rule, or arbitrarily if the pivot 
rule doesn’t specify how to break ties in this case. 

8. Replace the current feasible basis B by the new feasible basis (B\ {u})U 
{v}. Update the simplex tableau so that it corresponds to this new basis. 
Go to Step 4. 


This is all we wanted to say about the simplex method here. May your 
pivot steps lead you straight to the optimum and never cycle! 


6. Duality of Linear Programming 


6.1 The Duality Theorem 


Here we formulate arguably the most important theoretical result about lin- 
ear programs. 
Let us consider the linear program 


maximize 27, + 322 

subject to 4a, + 8x2 < 12 
221 +r x < 3 (6.1) 
321 T 222 < 4 
X11, 02 = 0. 


Without computing the optimum, we can immediately infer from the first 
inequality and from the nonnegativity constraints that the maximum of the 
objective function is not larger than 12, because for nonnegative 7; and xe 
we have 

2%, + 322 < 40, + 822 < 12. 


We obtain a better upper bound if we first divide the first inequality by two: 
221 + 322 < 221 + 429 < 6. 


An even better bound results if we add the first two inequalities together and 
divide by three, which leads to the inequality 


(12+ 3) =5, 


(oe 


1 
221 + 3X2 = 3 (401 + 8x2 + 221 + x2) < 


and hence the objective function cannot be larger than 5. 

How good an upper bound can we get in this way? And what does “in 
this way” mean? Let us begin with the latter question: From the constraints, 
we are trying to derive an inequality of the form 


dyx, + dzX2 < h, 


where d; > 2, dz > 3, and h is as small as possible. Then we can claim that 
for all 71,22 > 0 we have 
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224 + 322 < dy xy + dgx2 < h, 


and therefore, h is an upper bound on the maximum of the objective function. 
How can we derive such inequalities? We combine the three inequalities in the 
linear program with some nonnegative coefficients y1, yo, y3 (nonnegativity is 
needed so that the direction of inequality is not reversed). We obtain 


(4y1 + 2y2 + 3y3)a1 + (8y1 + yo + 2y3)e2 < 12y1 + 3y2 + 4ys, 


and thus dy; = 4y1 + 2y2+ 3y3, do = 8y1 + yo + 2y3, and h = 12y1 + 3y2 + 4y3. 

How do we choose the best coefficients yi, y2,y3? We must ensure that 
d, > 2 and dz > 3, and we want h to be as small as possible under these 
constraints. This is again a linear program: 


Minimize 12y, + 3y2 + 4ys 
subject to 4y, + 2y2 + 3y3 > 2 
8y1 + yo + 2y3 2 3 


Y1; Y2, 43 = 0. 


It is called the linear program dual to the linear program (6.1) we started 
with. The dual linear program “guards” the original linear program from 
above, in the sense that every feasible solution (y1, y2, y3) of the dual linear 
program provides an upper bound on the maximum of the objective function 
in (6.1). 

How well does it guard? Perfectly! The optimal solution of the dual linear 
program is y = (4,0, +) with objective function equal to 4.75, and this 
is also the optimal value of the linear program (6.1), which is attained for 
x = (4,5). 

The duality theorem asserts that the dual linear program always guards 
perfectly. Let us repeat the above considerations in a more general setting, 
for a linear program of the form 


maximize c’x subject to Ax < b and x > 0, (P) 
where A is a matrix with m rows and n columns. We are trying to combine 
the m inequalities of the system Ax < b with some nonnegative coefficients 
Y1; Y2)+++)Ym So that 

e the resulting inequality has the jth coefficient at least cj, 7 = 1,2,...,n, 


and 
e the right-hand side is as small as possible. 


This leads to the dual linear program 
minimize b7y subject to A7y >c and y > 0; (D) 


whoever doesn’t believe this may write it in components. In this context the 
linear program (P) is referred to as the primal linear program. 

From the way we have produced the dual linear program (D), we obtain 
the following result: 
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6.1.1 Proposition. For each feasible solution y of the dual linear program 
(D) the value b’ y provides an upper bound on the maximum of the objective 
function of the linear program (P). In other words, for each feasible solution 
x of (P) and each feasible solution y of (D) we have 


c’x <b’ y. 


In particular, if (P) is unbounded, (D) has to be infeasible, and if (D) is 
unbounded (from below!), then (P) is infeasible. 


This proposition is usually called the weak duality theorem, weak because 
it expresses only the guarding of the primal linear program (P) by the dual 
linear program (D), but it doesn’t say that the guarding is perfect. The latter 
is expressed only by the duality theorem (sometimes also called the strong 
duality theorem). 


Duality theorem of linear programming 


For the linear programs 
maximize c’x subject to Ax < b andx >0 


and 
minimize b’y subject to ATy > c and y > 0 


exactly one of the following possibilities occurs: 


1. Neither (P) nor (D) has a feasible solution. 

2. (P) is unbounded and (D) has no feasible solution. 

3. (P) has no feasible solution and (D) is unbounded. 

4. Both (P) and (D) have a feasible solution. Then both have an optimal 
solution, and if x* is an optimal solution of (P) and y* is an optimal 
solution of (D), then 


c’x* =b’y*. 


That is, the maximum of (P) equals the minimum of (D). 


The duality theorem might look complicated at first encounter. For un- 
derstanding it better it may be useful to consider a simpler version, called 
the Farkas lemma and discussed in Section 6.4. This simpler statement has 
several intuitive interpretations, and it contains the essence of the duality 
theorem. 

Proving the duality theorem, which we will undertake in Sections 6.3 
and 6.4, does take some work, unlike a proof of the weak duality theorem, 
which is quite easy. 


The heart of the duality theorem is the equality c’x* = b’y* in 
the fourth possibility, i.e., for both (P) and (D) feasible. 
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Since a linear program can be either feasible and bounded, or fea- 
sible and unbounded, or infeasible, there are 3 possibilities for (P) 
and 3 possibilities for (D), which at first sight gives 9 possible com- 
binations for (P) and (D). The three cases “(P) unbounded and (D) 
feasible bounded,” “(P) unbounded and (D) unbounded,” and “(P) 
feasible bounded and (D) unbounded” are ruled out by the weak du- 
ality theorem. In the proof of the duality theorem, we will rule out the 
cases “(P) infeasible and (D) feasible bounded,” as well as “(P) fea- 
sible bounded and (D) infeasible.” This leaves us with the four cases 
listed in the duality theorem. All of them can indeed occur. 


Once again: feasibility versus optimality. In Chapter 1 we remarked 
that finding a feasible solution of a linear program is in general computation- 
ally as difficult as finding an optimal solution. There we briefly substantiated 
this claim using binary search. The duality theorem provides a considerably 
more elegant argument: The linear program (P) has an optimal solution if and 
only if the following linear program, obtained by combining the constraints 
of (P), the constraints of (D), and an inequality between the objective func- 
tions, has a feasible solution: 


Maximize c?!x 


subject to Ax <b, 


ATy>c¢, 

c’x >b’y, 

x>0,y > 0. 
(the objective function is immaterial here, and the variables are 21,...,2@n 
and y1,---,Ym)- Moreover, for each feasible solution (x, y) of the last linear 


program, X is an optimal solution of the linear program (P). All of this is a 
simple consequence of the duality theorem. 


6.2 Dualization for Everyone 


The duality theorem is valid for each linear program, not only for one of the 
form (P); we have only to construct the dual linear program properly. To 
this end, we can convert the given linear program to the form (P) using the 
tricks from Sections 1.1 and 4.1, and then the dual linear program has the 
form (D). The result can often be simplified; for example, the difference of 
two nonnegative variables can be replaced by a single unbounded variable 
(one that may attain all real values). 

Simpler than doing this again and again is to adhere to the recipe be- 
low (whose validity can be proved by the just mentioned procedure). Let us 
assume that the primal linear program has variables 71, 22,...,%,, among 
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which some may be nonnegative, some nonpositive, and some unbounded. 
Let the constraints be C1, C2,...,Cm, where C; has the form 


bj. 


Aji Ly + Ay2%Q + +++ + AinLn 


I IVIA 


(The nonnegativity or nonpositivity constraints for the variables are not 
counted among the C;.) The objective function cx should be mazimized. 

Then the dual linear program has variables y1, y2,.--,Ym, where y; cor- 
responds to the constraint C; and satisfies 


yi 2 0 < 
yi <0 if we have ¢ > 4 in Cj. 
Yi€ R = 
The constraints of the dual linear program are Q1,Qz2,...,Qn, where Q; 
corresponds to the variable x; and reads 
= xj 20 
Q1jY1 + G25Y2 +++++amjYm 4 < ? cj if x; satisfies 7 xj < 0 


The objective function is b’ y, and it is to be minimized. 

Note that in the first part of the recipe (from primal constraints to dual 
variables) the direction of inequalities is reversed, while in the second part 
(from primal variables to dual constraints) the direction is preserved. 


Dualization Recipe 
es 
Variables aa Y15 Y2.+-+5Ym 
Matrix AT 


Right-hand side c 


T 


Objective function max c* x min b? y 


Constraints ith constraint has yi = 0 
yi <0 
Yi € R 


x; > 0 jth constraint has 
Xj = 0 
TE R 
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If we want to dualize a minimization linear program, we can first trans- 
form it to a maximization linear program by changing the sign of the objective 
function, and then follow the recipe. 

In this way one can also find out that the rules work symmetrically “there” 
and “back.” By this we mean that if we start with some linear program, 
construct the dual linear program, and then again the dual linear program, we 
get back to the original (primal) linear program; two consecutive dualizations 
cancel out. In particular, the linear programs (P) and (D) in the duality 
theorem are dual to each other. 


A physical interpretation of duality. Let us consider a linear 
program 


maximize c’ x subject to Ax < b. 
According to the dualization recipe the dual linear program is 
minimize b’y subject to Ay = c and y > 0. 


Let us assume that the primal linear program is feasible and bounded, 
and let n = 3. We regard x as a point in three-dimensional space, and 
we interpret c as the gravitation vector; it thus points downward. 

Each of the inequalities of the system Ax < b determines a half- 
space. The intersection of these half-spaces is a nonempty convex poly- 
hedron bounded from below. Each of its two-dimensional faces is given 
by one of the equations ax = b;, where the vectors aj, ag,..., Am are 
the rows of the matrix A, but interpreted as column vectors. Let us 
denote the face given by af x = b; by S; (not every inequality of the 
system Ax < b has to correspond to a face, and so S; is not necessarily 
defined for every ‘). 

Let us imagine that the boundary of the polyhedron is made of 
cardboard and that we drop a tiny steel ball somewhere inside the 
polyhedron. The ball falls and rolls down to the lowest vertex (or 
possibly it stays on a horizontal edge or face). Let us denote the re- 
sulting position of the ball by x*; thus, x* is an optimal solution of 
the linear program. In this stable position the ball touches several two- 
dimensional faces, typically 3. Let D be the set of 7 such that the ball 
touches the face S;. For i € D we thus have 


Gravity exerts a force F on the ball that is proportional to the 
vector c. This force is decomposed into forces of pressure on the faces 
touched by the ball. The force F; by which the ball acts on face S; is 
orthogonal to S$; and it is directed outward from the polyhedron (if we 
neglect friction); see the schematic two-dimensional picture below: 
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The forces acting on the ball are in equilibrium, and thus F = 
> icp Fi. The outward normal of the face S; is aj; hence F; is propor- 
tional to a;, and for some nonnegative numbers y* we have 


So yjai =e. 


1E€D 


If we set y* = 0 for i ¢ D, we can write 0", y#a; =c, or ATy* =c 
in matrix form. Therefore, y* is a feasible solution of the dual linear 
program. 

Let us consider the product (y*)?(Ax* — b). For i ¢ D the ith 
component of y* equals 0, while for i € D the ith component of 
Ax* —b is 0 according to (6.2). So the product is 0, and hence (y*)?b = 
(y*)? Ax* > cx". 

We see that x* is a feasible solution of the primal linear program, 
y* is a feasible solution of the dual linear program, and c?x* = b’y*. 
By the weak duality theorem y* is an optimal solution of the dual 
linear program, and we have a situation exactly as in the duality the- 
orem. We have just “physically verified” a special three-dimensional 
case of the duality theorem. 


We remark that the dual linear program also has an economic inter- 
pretation. The dual variables are called shadow prices in this context. 
The interested reader will find this nicely explained in Chvatal’s text- 
book cited in Chapter 9. 


6.3 Proof of Duality from the Simplex Method 


The duality theorem of linear programming can be quickly derived from the 
correctness of the simplex method. To be precise, we will prove the following: 


If the primal linear program (P) is feasible and bounded, then the dual 
linear program (D) is feasible (and bounded as well, by weak duality), 
with the same optimum value as the primal. 
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Since the dual of the dual is the primal, we may interchange (P) and (D) 
in this statement. Together with our considerations about the possible cases 
after the statement of the duality theorem, this will prove the theorem. 

The key observation is that we can extract an optimal solution of the dual 
linear program from the final tableau. We should recall, though, that proving 
the correctness of the simplex method, and in particular, the fact that one 
can always avoid cycling, requires considerable work. 

Let us consider a primal linear program 


maximize c?x subject to Ax < b and x > 0. (P) 


After a conversion to equational via slack variables 7,41,...,2@n4+m We arrive 
at the linear program 


maximize €/X subject to AX = b and X > 0, 


where X = (1,...;2n+m), € = (€1,..-,€n,0,...,0), and A = (A|J,,). If this 
last linear program is feasible and bounded, then according to Theorem 5.8.1, 
the simplex method with Bland’s rule always finds some optimal solution * 
with a feasible basis B. The first n components of the vector X* constitute 
an optimal solution x* of the linear program (P). By the optimality criterion 
we have r < 0 in the final simplex tableau, where r is the vector in the z-row 
of the tableau as in Section 5.5. The following lemma and the weak duality 
theorem (Proposition 6.1.1) then easily imply the duality theorem. 


6.3.1 Lemma. In the described situation the vector y* = (€4A;,')" is a 
feasible solution of the dual linear program (D) and the equality e' x* = b’ y* 
holds. 


Proof. By Lemma 5.5.1, X* is given by Xz = A;'b and Xj, = 0, and so 


cl x* = c! x* = chxy, = ch (Ag'b) = (Eh Az')b = (y*)? b = b’y*. 


The equality c’x* = b’ y* thus holds, and it remains to check the feasibility 
of y*, that is, AT y* > c and y* > 0. 

The condition y* > 0 can be rewritten to I,y* > 0, and hence both of 
the feasibility conditions together are equivalent to 


Aly* >. (6.3) 


After substituting y* = (€ZA;')? the left-hand side becomes A? (€p, Aj)? = 
(€,A;' A)”. Let us denote this (n+m)-component vector by w. For the basic 
components of w we have 


wep = (CpAz Az)” = (Chlm)’ =Ta, 


and thus we even have equality in (6.3) for the basic components. For the 
nonbasic components we have 
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wy = (ch Az An)? =C€Cy —r>tn 


since r = €% — (€54,'An)? by Lemma 5.5.1, and r < 0 by the optimality 
criterion. The lemma is proved. 


6.4 Proof of Duality from the Farkas Lemma 


Another approach to the duality theorem of linear programming consists in 
first proving a simplified version, called the Farkas lemma, and then substi- 
tuting a skillfully composed matrix into it and thus deriving the theorem. 
A nice feature is that the Farkas lemma has very intuitive interpretations. 

Actually, the Farkas lemma comes in several natural variants. We begin 
by discussing one of them, which has a very clear geometric meaning. 


6.4.1 Proposition (Farkas lemma). Let A be a real matrix with m rows 
and n columns, and let b € R™ be a vector. Then exactly one of the following 
two possibilities occurs: 


(Fl) There exists a vector x € R” satisfying Ax = b and x > 0. 
(F2) There exists a vector y € R™ such that y’ A > 07 and y"b < 0. 


It is easily seen that both possibilities cannot occur at the same time. 
Indeed, the vector y in (F2) determines a linear combination of the equa- 
tions witnessing that Ax = b cannot have any nonnegative solution: All 
coefficients on the left-hand side of the resulting equation (y7.A)x = y’b are 
nonnegative, but the right-hand side is negative. 

The Farkas lemma is not exactly a difficult theorem, but it is not trivial 
either. Many proofs are known, and we will present some of them in the 
subsequent sections. The reader is invited to choose the “best” one according 
to personal taste. 


A geometric view. In order to view the Farkas lemma geometrically, we 
need the notion of convex hull; see Section 4.3. Further we define, for vectors 
@ 1, 42,...,a, € R™, the convex cone generated by aj, a2,..., a, as the set 
of all linear combinations of the a; with nonnegative coefficients, that is, as 


{tai + toa +--+ tan i tastes -.estn 2 Of. 


In other words, this convex cone is the convex hull of the rays p1,p2,...,Dn, 
where p; = {ta; : t > 0} emanates from the origin and passes through the 
point a;. 


6.4.2 Proposition (Farkas lemma geometrically). Let aj, a:,..., an, b 
be vectors in R™. Then exactly one of the following two possibilities occurs: 


(Fl’) The point b lies in the convex cone C' generated by a), a2,...,An- 
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(F2') There exists a hyperplane h passing through the point 0, of the form 
h={xe€R™:y?x=0} 


for a suitable y € R™, such that all the vectors aj,a2,...,a, (and thus 
the whole cone C’)) lie on one side and b lies (strictly) on the other side. 
That is, ya; > 0 for alli = 1,2,...,n and yb < 0. 


A drawing illustrates both possibilities for m = 2 and n = 3: 


(F2’) 


To see that Proposition 6.4.1 and Proposition 6.4.2 really tell us the same 
thing, it suffices to take the columns of the matrix A for a;,a2,...,a,. The 
existence of a nonnegative solution of Ax = b can be reexpressed as b = 
tyayttgagt+:--+tnan, ti, te,...,tn > 0, and this says exactly that b € C’. The 
equivalence of (F2) and (F2’) hopefully doesn’t need any further explanation. 

This result is an instance of a separation theorem for convex sets. Sepa- 

ration theorems generally assert that disjoint convex sets can be separated 
by a hyperplane. There are several versions (depending on whether one re- 
quires strict or nonstrict separation, etc.) and several proof strategies. Sepa- 
ration theorems in infinite-dimensional Banach spaces are closely related to 
the Hahn—Banach theorem, one of the cornerstones of functional analysis. 
In Section 6.5 we prove the Farkas lemma along these lines, viewing it as a 
geometric separation theorem. 
Variants of the Farkas lemma. Proposition 6.4.1 provides an answer to 
the question, “When does a system of linear equalities have a nonnegative 
solution?” In part (i) of the following proposition, we restate Proposition 6.4.1 
(in a slightly different, but clearly equivalent form), and in parts (ii) and (iii), 
we add two more variants of the Farkas lemma. Part (ii) answers the question, 
“When does a system of linear inequalities have a nonnegative solution?” and 
part (iii) the question, “When does a system of linear inequalities have any 
solution at all?” 


6.4.3 Proposition (Farkas lemma in three variants). Let A be a real 
matrix with m rows and n columns, and let b € R™ be a vector. 
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(i) The system Ax = b has a nonnegative solution if and only if every 
y € R” with y7 A > O° also satisfies yb > 0. 
(ii) The system Ax < b has a nonnegative solution if and only if every 
nonnegative y € R™ with y’ A > 0! also satisfies yb > 0. 
(iii) The system Ax < b has a solution if and only if every nonnegative 
y € R” with y? A = 0° also satisfies y"b > 0. 


The three parts of Proposition 6.4.3 are mutually equivalent, in the sense 
that any of them can easily be derived from any other. Having three forms 
at our disposal provides more flexibility, both for applying the Farkas lemma 
and for proving it. 


The proof of the equivalence (i)@(ii) (iii) is easy, using the tricks 
familiar from transformations of linear programs to equational form. 
We will take a utilitarian approach: Since we will use (ii) in the proof 
of the duality theorem, we prove only the implications (i)=(ii) and 
(iii)=>(ii), leaving the remaining implications to the reader. 

Proof of (i)=(ii). In (ii) we need an equivalent condition for Ax < b 
having a nonnegative solution. To this end, we form the matrix A = 
(A|Im). We note that Ax < b has a nonnegative solution if and only 
if AX = b has a nonnegative solution. By (i), this is equivalent to 
the condition that all y with y’ A > 07 satisfy y’b > 0. And finally, 
y’ A> 0?" says exactly the same as y’ A > Of and y > 0, and hence 
we have the desired equivalence. 


Proof of (iii)= (ii). Again we need an equivalent condition for Ax < 
b having a nonnegative solution. This time we form the matrix A and 
the vector b according to 


1-(42). (3) 


Then Ax < b has a nonnegative solution if and only if Ax < b has 
any solution. The latter is equivalent, by (iii), to the condition that 
all ¥ > 0 with y? A = O° satisfy y"b > 0. Writing 


y-(+-). 


y a vector with m components, we have 


y>0, y'A=0" exactlyif y>0, y?=y?A>O07 
and 
y b=y'b. 
From this and our chain of equivalences, we deduce that Ax < b has 


a nonnegative solution if and only if all y > 0 with y7.A > O7 satisfy 
yb > 0, and this is the statement of (ii). 
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Remarks. A reader with a systematic mind may like to see the 
variants of the Farkas lemma summarized in a table: 


The system The system 

Ax <b Ax=b 
has a solution y>0,y7A>0 y'A>or 
x > 0 iff =>y'b>0 =>y"b>0 
has a solution y>0,y7A=0 y'A=O07 
x € R” iff =>y"b>0 =>y"b=0 


We had three variants of the Farkas lemma, but the table has four 
entries. We haven’t mentioned the statement corresponding to the 
bottom right corner of the table, telling us when a system of linear 
equations has any solution. We haven’t mentioned it because it doesn’t 
deserve to be called a Farkas lemma—the proof is a simple exercise 
in linear algebra, and there doesn’t seem to be any way of deriving 
the Farkas lemma from this variant along the lines of our previous 
reductions. However, we will find this statement useful in Section 6.6, 
where it will serve as a basis of a proof of a “real” Farkas lemma. 

Let us note that, similar to “dualization for everyone,” we could 
also establish a unifying “Farkas lemma for everyone,” dealing with 
a system containing both linear equations and inequalities and with 
some of the variables nonnegative and some unrestricted. This would 
contain all of the four variants considered above as special cases, but 
we will not go in this direction. 


A logical view. Now we explain yet another way of understanding the 
Farkas lemma, this time variant (iii) in Proposition 6.4.3. We begin with 
something seemingly different, namely, deriving new linear inequalities from 
old ones. From two given inequalities, say 


4e, +42 <4 and —2,4+ 272 <1, 


we can derive new inequalities by multiplying the first inequality by a positive 
real number a, the second one by a positive real number (, and adding the 
resulting inequalities together (we must be careful so that both inequality 
signs have the same direction!); we have already used this many times. For 
instance, for a = 3 and @ = 2 we derive the inequality 10”; + 5x42 < 14. 
More generally, if we start with a system of several linear inequalities, of the 
form Ax <b, we can derive new inequalities by repeating this operation for 
various pairs, which may involve both the original inequalities and new ones 
derived earlier. So if we start with the system 


4e, +42 <4, -@, +272 <1, and — 22, — 22 < —3, 


we can first derive 1027, + 5x22 < 14 from the first two as before, and then 


we can add to this new inequality the third inequality multiplied by 5. In 
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this case both of the coefficients on the left-hand side cancel out, and we get 
the inequality 0 < —1. This last inequality obviously never holds, and so the 
original triple of inequalities cannot be satisfied by any (21,272) € R either 
(as is easy to check using a picture). 

The Farkas lemma turns out to be equivalent to the following statement: 


Whenever a system Ax < b of finitely many linear inequalities is in- 
consistent, that is, there is no x € R” satisfying it, we can derive the 


(obviously inconsistent) inequality 0 < —1 from it by the above proce- 
dure. 


A little thought reveals that each inequality derived by the procedure 
(repeated combinations of pairs) has the form (y?.A)x < y’b for some non- 
negative vector y € R”™, and thus, equivalently, we claim that whenever 
Ax < b is inconsistent, there exists a vector y > 0 with y’A = O7 and 
yb = —1. This is clearly equivalent to part (iii) of Proposition 6.4.3. 


The reader may wonder why we have bothered to consider repeated 
pairwise combinations of inequalities, instead of using a single vector 
y specifying a combination of all of the inequalities right away. The 
reason is that the “pairwise” formulation makes the statement more 
similar to a number of important and famous statements in various 
branches of mathematics. In logic, for example, theorems are derived 
(proved) from axioms by repeated application of certain simple deriva- 
tion rules. In the first-order propositional calculus, there is a complete- 
ness theorem: Any true statement (that is, a statement valid in every 
model) can be derived from the axioms by a finite sequence of steps, 
using the appropriate derivation rules, such as modus ponens. In con- 
trast, the celebrated Gédel’s first incompleteness theorem asserts that 
in Peano arithmetic, as well as in any theory containing it, there are 
statements that are true but cannot be derived. 

In analogy to this, we can view the inequalities of the original sys- 
tem Ax < bas “axioms,” and we have a single derivation rule (derive a 
new inequality from two existing ones by a positive linear combination 
as above). Then the Farkas lemma tells us that any inconsistent system 
of “axioms” can be refuted by a suitable derivation. (This is a “weak” 
completeness theorem; we could also consider a more general “com- 
pleteness theorem,” stating that whenever a linear inequality is valid 
for all x € R” satisfying Ax < b, then it can be derived from Ax < b, 
but we will not go into this here.) Such a completeness result means 
that the theory of linear inequalities is, in a sense, “easy.” Moreover, 
the simplex method, or also the Fourier—Motzkin elimination consid- 
ered in Section 6.7, provide ways to construct such a derivation. 

This view makes the Farkas lemma a (small) cousin of various 
completeness theorems of logic and of other famous results, such as 
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Hilbert’s Nullstellensatz in algebraic geometry. Computer science also 
frequently investigates the possibility of deriving some object from 
given initial objects by certain derivation rules, say in the theory of 
formal languages. 


Proof of the duality theorem from the Farkas lemma. Let us assume 
that the linear program (P) has an optimal solution x*. As in the proof of 
the duality theorem from the simplex method, we show that the dual (D) has 
an optimal solution as well, and that the optimum values of both programs 
coincide. 

We first define y = ce? x* to be the optimum value of (P). Then we know 
that the system of inequalities 


Ax <b, c’x>¥ (6.4) 
has a nonnegative solution, but for any ¢ > 0, the system 
Ax <b, cx >y+e (6.5) 


has no nonnegative solution. If we define an (m+1)xn matrix A and a vector 


b. € R™ by 
~ A “ b 
A-(_%), eae 


then (6.4) is equivalent to Ax < bo and (6.5) is equivalent to Ax < be. 

Let us apply the variant of the Farkas lemma in Proposition 6.4.3(ii). For 
€ > 0, the system Ax < b- has no nonnegative solution, so we conclude that 
there is a nonnegative vector y = (u,z) € R™*! such that y7A > O07 but 
oT b- < 0. These conditions boil down to 


ATu> ze, b’u < 2(y+¢). (6.6) 


Applying the Farkas lemma in the case « = 0 (the system has a nonnegative 
solution), we see that the very same vector y must satisfy yy? bo > 0, and 
this is equivalent to 

blu > zy. 
It follows that z > 0, since z = 0 would contradict the strict inequality in 
(6.6). But then we may set v := 4+u > 0, and (6.6) yields 


A'v>c, b’v <yte. 


In other words, v is a feasible solution of (D), with the value of the objective 
function smaller than y + ¢. 

By the weak duality theorem, every feasible solution of (D) has value of 
the objective function at least y. Hence (D) is a feasible and bounded linear 
program, and so we know that it has an optimal solution y* (Theorem 4.2.3). 
Its value b? y* is between y and y + « for every ¢ > 0, and thus it equals 7. 
This concludes the proof of the duality theorem. 
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6.5 Farkas Lemma: An Analytic Proof 


In this section we prove the geometric version of the Farkas lemma, Propo- 
sition 6.4.2, by means of elementary geometry and analysis. We are given 
vectors a ;,...,a, in R™, and we let C be the convex cone generated by 
them, i.e., the set of all linear combinations with nonnegative coefficients. 
Proving the Farkas lemma amounts to showing that for any vector b ¢ C 
there exists a hyperplane separating it from C’ and passing through 0. In 
other words, we want to exhibit a vector y € R™ with y7b < 0 and y’x > 0 
for all x EC. 

The plan of the proof is straightforward: We let z be the point of C' nearest 
to b (in the Euclidean distance), and we check that the vector y = z — b is 
as required; see the following illustration: 


C 


The main technical part of the proof is to show that the nearest point z 
exists. Indeed, in principle, it might happen that no point is the nearest (for 
example, such a situation occurs for the point 0 on the real line and the open 
interval (1,2); the interval contains points with distance to 0 as close to 1 as 
desired, but no point at distance exactly 1). 


6.5.1 Lemma. Let C be a convex cone in R™ generated by finitely many 
vectors a1,...,A,, and let b ¢ C be a point. Then there exists a point z € C 
nearest to b (it is also unique but we won’t need this). 


Proof of Proposition 6.4.2 assuming Lemma 6.5.1. As announced, we 
set y = z—b, where z is a point of C nearest to b. 

First we check that y’z = 0. This is clear for z = 0. For z 4 0, 
if z were not perpendicular to y, we could move z slightly along the ray 
{tz: t>0}CC and get a point closer to b. More formally, let us first 
assume that y’z > 0, and let us set z’/ = (1 —a)z for a small a > 0. 


We calculate ||z’ — bll? = (y — az)? (y — az) = |ly||? — 2ay7z + a?||z||?. 
We have 2ay7z > a?||z||? for all sufficiently small a > 0, and thus 
\|z’ — bl|?_ < |ly||? = ||z — b||*. This contradicts z being a nearest point. 


The case y?z < 0 is handled similarly. 

To verify yb < 0, we recall that y 4 0, and we compute 0 < y7y = 
y'z—y'b=-—y’b. 

Next, let x € C, x # z. The angle Zbzx has to be at least 90 degrees, 
for otherwise, points on the segment zx sufficiently close to z would lie closer 
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to b than z; equivalently, (b — z)?(x — z) < 0 (this is similar to the above 
argument for y’z = 0 and we leave a formal verification to the reader). Thus 
0 > (b—z)?(x—z) = -y?x+y?z = —y? x. The Farkas lemma is proved. 


It remains to prove Lemma 6.5.1. We do it in several steps, and each of 
them is an interesting little fact in itself. 


6.5.2 Lemma. Let X C R™” be a nonempty closed set and let b € R™ be a 
point. Then X has (at least one) point nearest to b. 


Proof. This is simple but it needs basic facts about compact sets in R?. 
Let us fix an arbitrary x9 € X, let r = ||xo — bl|, and let K = {x EX: 
|x — bl] < r}. Clearly, if kK has a nearest point to b, then the same point 
is a point of X nearest to b. Since K is the intersection of X with a closed 
ball of radius r, it is closed and bounded, and hence compact. We define the 
function f: K — R by f(x) = ||x— bl]. Then f is a continuous function on a 
compact set, and any such function attains a minimum; that is, there exists 
z © K with f(z) < f(x) for all x € K. Such az is a point of K nearest 
to b. 


So it remains to prove the following statement: 
6.5.3 Lemma. Every finitely generated convex cone is closed. 


This lemma is not as obvious as it might seem. As a warning example, 
let us consider a closed disk D in the plane with 0 on the boundary. Then 
the cone generated by D, that is, the set {tx : x € D}, is an open half-plane 
plus the point 0, and thus it is not closed. Of course, this doesn’t contradict 
to the lemma, but it shows that we must use the finiteness somehow. 

Let us define a primitive cone in R™ as a convex cone generated by some 
k < m linearly independent vectors. Before proving Lemma 6.5.3, we deal 
with the following special case: 


6.5.4 Lemma. Every primitive cone P in R™ is closed. 


Proof. Let Po C R* be the cone generated by the vectors e;,...,e% of the 
standard basis of R*. In other words, Po is the nonnegative orthant, and its 
closedness is hopefully beyond any doubt (for example, it is the intersection 
of the closed half-spaces 7; > 0,1 = 1,2,...,k). 

Let the given primitive cone P C R™ be generated by linearly independent 
vectors a;,...,a,. We define a linear mapping f: R* — R™ by f(x) = a1a,+ 
x2a2 +++++ pax. This f is injective by the linear independence of the aj,, 
and we have P = f(P). So it suffices to prove the following claim: The image 
P = f(Po) of a closed set Py under an injective linear map f:R* — R™ is 
closed. 

To see this, we let L = f(R*) be the image of f. Since f is injective, it 
is a linear isomorphism of R* and L. A linear isomorphism f has a linear 
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inverse map g = f—!: L — R*. Every linear map between Euclidean spaces 
is continuous (this can be checked using a matrix form of the map), and we 
have P = g~!(Po). The preimage of a closed set under a continuous map is 
closed by definition (while the image of a closed set under a continuous map 
need not be closed in general!), so P is closed as a subset of L. Since L is 
closed in R™ (being a linear subspace), we get that P is closed as desired. 


Lemma 6.5.3 is now a consequence of Lemma 6.5.4, of the fact that the 
union of finitely many closed sets is closed, and of the next lemma: 


6.5.5 Lemma. Let C be a convex cone in R™ generated by finitely many 
vectors @,,...,a,. Then C' can be expressed as a union of finitely many 
primitive cones. 


Proof. For every x € C we are going to verify that it is contained in a prim- 
itive cone generated by a suitable set of linearly independent vectors among 
the a;. We may assume x # 0 (since {0} is the primitive cone generated by 
the empty set of vectors). 

Let I C {1,2,...,n} be a set of minimum possible size such that x lies 
in the convex cone generated by A; = {a; : i € I} (this is a standard trick 
in linear algebra and in convex geometry). That is, there exist nonnegative 
coefficients a;, 7 € I, with x = aes, a,a;. The a; are even strictly positive 
since if some a; = 0, we could delete 7 from J. We now want to show that the 
set A; is linearly independent. For contradiction, we suppose that there is 
a nontrivial linear combination )),-,; 3;a; = 0, where not all 3; are 0. Then 
there exists a real t such that all the expressions a; — t@; are nonnegative and 
at least one of them is zero. (To see this, we can first consider the case that 
some (3; is strictly positive, we start with t = 0, we let it grow, and see what 
happens. The case of a strictly negative (@; is analogous with t decreasing 
from the initial value 0.) Then the equation 


xX = So (ai ~~ tG;)a; 
tel 


expresses x as a linear combination with positive coefficients of fewer than 
|I| vectors. 


6.6 Farkas Lemma from Minimally Infeasible 
Systems 


Here we derive the Farkas lemma from an observation concerning minimally 
infeasible systems. A system Ax < b of m inequalities is called minimally 
infeasible if the system has no solution, but every subsystem obtained by 
dropping one inequality does have a solution. 


98 6. Duality of Linear Programming 


6.6.1 Lemma. Let Ax < b be a minimally infeasible system of m inequal- 
ities, and let AMx < b© be the subsystem obtained by dropping the ith 
inequality, i = 1,2,...,m. Then for every i there exists a vector x such 
that AQxM = b®, 

Let us set a; = (aj1,4i2,.--,@in) and write the ith inequality as al’x < bj. 
Here is an illustration for an example in the plane (n = 2) with m = 3 
inequalities: 


Proof. We consider the linear program 


minimize Zz ; 
(i) 

subject to Ax <b+ ze, (LP™) 

where e; is the ith unit vector. The idea of (LP™) is to translate the half-space 

{x :alx < b;} by the minimum amount necessary to achieve feasibility. For 
the example illustrated above and 7 = 3, this results in the following picture: 
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To show formally that (LP) has an optimal solution, we first argue that 
it has a feasible solution. Indeed, by the assumption, the system AMx < b 
has at least one solution. Let us fix an arbitrary solution of this system and 
denote it by X. We put 7 = a? X — b;, and we note that the vector (X,Z) is a 
feasible solution of the linear program (LP), 

Next, we note that (LP) is also bounded, since Ax < b has no solution. 
Therefore, the linear program has an optimal solution (x, 2) with 2 > 0 
by Theorem 4.2.3. 

We claim that the just defined x satisfies A9x® = bM. We al- 
ready know that A®x@ < b©. Let us suppose for contradiction that 
al x) = b; —e for some j # i and € > 0. We will show that then (x, 2) 
cannot be optimal for (LP). To this end, let us consider an optimal solu- 
tion (x, 29) of (LP). The idea is that by moving the point (x, 2) 
slightly towards (x), 2), we remain feasible for (LP), but we improve 
the objective function of (LP). More formally, for a real number t > 0, we 
define x(t) = (1 — t)x +t). It follows that 


afx(t) < bj — (1—te + 420), 
al x(t) < b+ (1—-2)2%, 
af x(t) < 6,, for all k 4i,j. 


Thus for t sufficiently small, namely, for 0 < t < (1 — t)e/Z%, the pair 
(x(t), (1—-t)2™) is a feasible solution of (LP) with objective function strictly 
smaller than 2, contradicting the assumed optimality of (x, 2). Thus, 
AMZ = b™ and the lemma is proved. 


We need another lemma that proves an “easy” variant of the Farkas 
lemma, concerned with arbitrary solutions of systems of equalities. 


This lemma establishes the implication in the bottom right corner 
of the table of Farkas lemma variants on page 92. 


6.6.2 Lemma. The system Ax = b has a solution if and only if every y € 
R”™ with y? A = 07 also satisfies y’ b = 0. 


Proof. One direction is easy. If Ax = b has some solution X, and if yA = 
07”, then 0 = 07x = y? Ax = y’b. 

If Ax = b has no solution, we need to find a vector y such that y? A = 07 
and yb ¥ 0. Let us define r = rank(A) and consider the m x (n +1) matrix 
(A|b). This matrix has rank r+ 1 since the last column is not a linear 
combination of the first n columns. For the very same reason, the matrix 


(or) 
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has rank r+1. This shows that the row vector (07 | —1) is a linear combination 
of rows of (A |b), and the coefficients of this linear combination define a vector 
y € R™ with y7A = 07 and y”b = —1, as desired. 

Now we proceed to the proof of the Farkas lemma. The variant that 
results most naturally is the one with an arbitrary solution of Ax < b, that 
is, Proposition 6.4.3(iii). 


Proof of Proposition 6.4.3(iii). As in Lemma 6.6.2, one direction is 
easy: If Ax < b has some solution x, and if y > 0, y’A = O7, we get 
0 = 07x = y? Ax < yb. The interesting case is that Ax < b has no 
solution. Our task is then to construct a vector y > O satisfying y7.A = 07 
and y’b < 0. 

We may assume that Ax < b is minimally infeasible, by restricting to a 
suitable subsystem: A vector y for this subsystem can be extended to work 
for the original system by inserting zeros at appropriate places. 

Since Ax < b has no solution, the system Ax = b has no solution either. 
By Lemma 6.6.2, there exists a vector y € R™ such that y’A = O7 and 
yb 0. By possibly changing signs, we may assume that y’b < 0. We will 
show that this vector also satisfies y > 0, and this will finish the proof. To 
this end, we fix i € {1,2,...,m} and consider the vector x as in Lemma 
6.6.1 above. With the terminology of the lemma, we have Ax = b©, and 
using y? A = 07, we can write 


yi(alx™ — b;) = y7 (Ax —b) =-yTb> 0. 


Proposition 6.4.3(iii) is proved. 


This proof of the Farkas lemma is based on the paper 


M. Conforti, M. Di Summa, and G. Zambelli: Minimally in- 
feasible set partitioning problems with balanced constraints, 
Mathematics of Operations Research, to appear. 


The proof given there is even more elementary than ours in the sense 
that it does not use linear programming. We have chosen the linear 
programming approach since we find it somewhat more transparent. 


6.7 Farkas Lemma from the Fourier—Motzkin 
Elimination 


When explaining the “logical view” of the Farkas lemma in Section 6.4, we 
started with a system of 3 inequalities and combined pairs of inequalities 
together, until we managed to eliminate all variables and obtained the obvi- 
ously unsatisfiable inequality 0 < —1. The Fourier—Motzkin elimination is a 
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systematic procedure for eliminating all variables from an arbitrary system 
Ax <b of linear inequalities. If the final inequalities with no variables hold, 
we can reconstruct a solution of the original system by tracing the computa- 
tions backward, and if one of the final inequalities does not hold, it certifies 
that the original system has no solution. 

The Fourier—Motzkin elimination is similar in spirit to Gaussian elimina- 
tion for systems of linear equations, and it is just as simple. As in Gaussian 
elimination, variables are removed one at a time, but there is a price to pay: 
To get rid of one variable, we typically have to introduce many new inequal- 
ities, so that the method becomes impractical already for moderately large 
systems. The Fourier—Motzkin elimination can be considered as a simple but 
inefficient alternative to the simplex method. For the purpose of proving 
statements about systems of inequalities, efficiency is not a concern, so it is 
the simplicity of the Fourier—Motzkin elimination that makes it a very handy 
tool. 

As an example, let us consider the following system of 5 inequalities in 
3 variables: 


2x — dy + 4z < 10 
3x — by +32 < 9 
5a + 1l0y- z< 15 (6.7) 
—-x% + 5y — 22 < -7 
3x 2y 6z < 12. 


In the first step we would like to eliminate x. For a moment let us imagine that 
y and z are some fixed real numbers, and let us ask under what conditions 
we can choose a value of x such that together with the given values y and z it 
satisfies (6.7). The first three inequalities impose an upper bound on 2, while 
the remaining two impose a lower bound. To make this clearer, we rewrite 
the system as follows: 


xa< 5 + By — 2z 
es 384+ 2y- 2 
xw< 3 — 2y + $2 
g> T+ dy -— 2z 
g>—-4+4 zy + 2z. 


So given y and z, the admissible values of x are exactly those in the interval 
from max(7+5y—2z, —44 y +2z) to min(5+3y—2z,3+2y—z, 3—2y4 32). If 
this interval happens to be empty, there is no admissible x. So the inequality 


max(7 + 5y — 2z,-4+ 2y + 2z) 


< min(5 4 oy 22,34 2y— 2z,3— 2y4 $2) oy 


is equivalent to the existence of x that together with the considered y and z 
solves (6.7). The key observation in the Fourier—Motzkin elimination is that 
(6.8) can be rewritten as a system of linear inequalities in the variables y 
and z. The inequalities simply say that each of the lower bounds is less than 
or equal to each of the upper bounds: 
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7+ 5y—-—2z7 < 5+ dy —2z 
7+ 5y—2z2 < 342y-2z 
7+ 5y—22 < 3—2y+ tz 
—4 +4 Sy 4 2z < 5+ 3y—2z 
—4 +4 Sy 4 2z < 342y-—2 
4 zy 22 < 3—Qy+ Ez 


If we rewrite this system in the usual form Ax < b, we arrive at 


zy < -2 

3y — z< 4 

Ty - iy < —4 

1,4 (6.9) 
-sy+ 4z< 9 
—fy + 382< 7 

Sy + 2 <7. 


This system has a solution exactly if the original system (6.7) has one, but it 
has one variable fewer. The reader is invited to continue with this example, 
eliminating y and then z. We note that (6.9) gives 4 upper bounds for y and 
2 lower bounds, and hence we obtain 8 inequalities after eliminating y. 

For larger systems the number of inequalities generated by the Fourier— 
Motzkin elimination tends to explode. This wasn’t so apparent for our small 
example, but if we have m inequalities and, say, half of them impose upper 
bounds on the first variable and half impose lower bounds, then we get about 
m?/4 inequalities after eliminating the first variable, about m*/16 after elimi- 
nating the second variable (again, provided that about half of the inequalities 
give upper bounds for the second variable and half lower bounds), etc. 

Now we formulate the procedure in general. 


Claim. Let Ax <b be a system with n > 1 variables and m inequalities. 
There is a system A’x' <b! with n —1 variables and at most max(m,m?/4) 
inequalities, with the following properties: 


(i) Ax <b has a solution if and only if A’x’ < b’ has a solution, and 
(ii) each inequality of A’x' < b’ is a positive linear combination of some 
inequalities from Ax < b. 


Proof. We classify the inequalities into three groups, depending on 
the coefficient of 71. We call the ith inequality of Ax < b a ceiling if 
ai, > 0, and we call it a floor if aj; < 0. Otherwise (if aj; = 0), it is 
a level. Let C, F, L C {1,...,m} collect the indices of ceilings, floors, 
and levels. We may assume that 
1 ifieC 
ay = -1 ifieF (6.10) 
0 ifte L. 
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This situation can be reached by multiplying each inequality in Ax < 
b by a suitable positive number, which does not change the set of 
solutions. 

Now we can eliminate x; between all pairs of ceilings and floors, 
by simply adding up the two inequalities for each pair. 

If x’ is the (possibly empty) vector (a2,...,2,), and a’; is the 
(possibly empty) vector (aj2,...,@in), then the following inequalities 
are implied by Ax < b: 


iT oy 


al;x’ +al.x <b tbe, FCC,KEF. (6.11) 


The level inequalities of Ax < b can be rewritten as 


ax’ <b, CEL. (6.12) 


So if Ax < b has a solution, then the system of |C|- |F'| + |Z| 
inequalities in n — 1 variables given by (6.11) and (6.12) has a 
solution as well. Conversely, if the latter system has a solution 
xX’ = (Z2,...,%,), we can determine a suitable value %, such that 
the vector (%1,%2,...,%n,) solves Ax < b. To find “1, we first ob- 
serve that (6.11) is equivalent to 


a x -bysbpalix, GEO, REF 


This in particular implies 


T ; T~ 
max (a'r x’ — bx) < min (; —a’; x) : 
keF jEC 


We let x; be any value between these bounds. It follows that 
Hy tax < bj; je, 
Spal << ty. REF 


By our assumption (6.10), we have a feasible solution of the original 
system Ax < b. We note that this argument also works for C = 0 

or F = @, with the usual convention that max;,cg f(t) = —oo and 
minyeg f(t) = 00. 

Now we can prove the Farkas lemma. The variant that results most natu- 


rally from the Fourier—Motzkin elimination is (as in Section 6.6) the one with 
an arbitrary solution of Ax < b, that is, Proposition 6.4.3(iii). 


Proof of Proposition 6.4.3(iii). One direction is easy. If Ax < b has some 
solution x, and y > 0 satisfies yA = 07, we get 0 = 07x = y? Ax < y”b. 
If Ax < b has no solution, then our task is to construct a vector y satisfying 


y>0, y’A=0", andy’b<0. (6.13) 
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To find such a witness of infeasibility, we use induction on the number of 
variables. Let us first consider the base case in which the system Ax < b has 
no variables, meaning that it is of the form 0 < b with b; < 0 for some 7. We 
set y = e; (the ith unit vector), and this clearly satisfies the requirements for 
y being a witness of infeasibility (the condition y’ A = O07 is vacuous, since 
A has no column). 

If Ax < b has at least one variable, we perform a step of the Fourier— 
Motzkin elimination. This yields an infeasible system A’x’ < b’, consisting of 
the inequalities (6.11) and (6.12). Because the latter system has one variable 
fewer, we inductively find a witness of infeasibility y’ for it. We recall that all 
inequalities of A’x’ < b’ are positive linear combinations of original inequal- 
ities; equivalently, there is an m x m matrix M with all entries nonnegative 
and 

(0|A’)= MA, b’=Mb. 
We claim that y = M7 y’ is a witness of infeasibility for the original system 
Ax < b. Indeed, we have y'A = y’"MA = y’"(0| A’) = O7 and y?b = 
y’’ Mb = y'"b’ < 0, since y’ is a witness of infeasibility for A’x’ < b’. The 
condition y > 0 follows from y’ > 0 by the nonnegativity of M. 


7. Not Only the Simplex Method 


Tens of different algorithms have been suggested for linear programming over 
the years. Most of them didn’t work very well, and only very few have turned 
out as serious competitors to the simplex method, the historically first al- 
gorithm. But at least two methods raised great excitement at the time of 
discovery and they are surely worth mentioning. 

The first of them, the ellipsoid method, cannot compete with the simplex 
method in practice, but it had immense theoretical significance. It is the first 
linear programming algorithm for which it was proved that it always runs in 
polynomial time (which is not known about the simplex method up to the 
present, and for many pivot rules it is not even true). 

The second is the interior point method, or rather, we should say interior 
point methods, since it is an entire group of algorithms. For some of them a 
polynomial bound on the running time has also been proved, but moreover, 
these algorithms successfully compete with the simplex method in practice. 
It seems that for some types of linear programs the simplex method is better, 
while for others interior point methods are the winners. 


Let us remark that several other algorithms, closely related to 
the simplex method, are used for linear programming as well. The 
dual simplex method can roughly be described as the simplex method 
applied to the dual linear program. But details of the implementa- 
tion, which are crucial for the speed of the algorithm in practice, are 
somewhat different. The dual simplex method is particularly suitable 
for linear programs that in equational form have n — m significantly 
smaller than m. 

The primal-dual method goes through a sequence of feasible so- 
lutions of the dual linear program. To get from one such solution to 
the next, it does not perform a pivot step, but it solves an auxiliary 
problem that may be derived from the primal linear program or by 
other means. This greater freedom can be useful, for instance, in ap- 
proximation algorithms for combinatorial optimization problems. 

A little more about the dual simplex method and the primal—dual 
method can be found in the glossary. 
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7.1 The Ellipsoid Method 


The ellipsoid method was invented in 1970 by Shor, Judin, and Nemirovski 
as an algorithm for certain nonlinear optimization problems. In 1979 Leonid 
Khachyian outlined, in a short note, how linear programs can be solved by 
this method in provably polynomial time. The world press made a sensa- 
tion out of this since the journalists contorted the result and presented it as 
an unprecedented breakthrough in practical computational methods (giving 
the Soviets a technological edge over the West... ).' However, the ellipsoid 
method has never been interesting for the practice of linear programming— 
Khachyian’s discovery was indeed extremely significant, but for the theory of 
computational complexity. It solved an open problem that many people had 
attacked in vain for many years. The solution was conceptually utterly dif- 
ferent from previous approaches, which were mostly variations of the simplex 
method. 


Input size and polynomial algorithms. In order to describe what we 

mean by a polynomial algorithm for linear programming, we have to define 

the input size of a linear program. Roughly speaking, it is the total number 

of bits needed for writing down the input to a linear programming algorithm. 
First we define the bit size of an integer 7 as 


(i) = [log (|é] + 1)] +1, 


which is the number of bits of 7 written in binary, including one bit for the 
sign. For a rational number r, i.e., a fraction r = p/q, the bit size is defined as 
(r) = (p)+(q). For an n-component rational vector v we put (v) = )>7"_, (vi), 


' The following quotation from 


E. L. Lawler: The Great Mathematical Sputnik of 1979, Math. Intelligencer 
2(1980) 191-198, 


which is a remarkable article about the history of Khachyian’s result, is not only 
of historical interest: 

The Times story appears to have been based on certain unshakable precon- 
ceptions of its writer, Malcolm W. Browne. Browne called George Dantzig, of 
Stanford University, a great pioneering authority on linear programming, and 
tried to force him into various admissions. Dantzig’s version of the interview 
bears repeating: 


“What about the traveling salesman problem?” asked Browne. “If there 
is a connection, I don’t know what it is,” said Dantzig. (“The Russian dis- 
covery proposed an approach for [solving] a class of problems related to 
the “Traveling Salesman Problem,” reported Browne.) “What about cryp- 
tography?” asked Browne. “If there is a connection, I don’t know what it 
is,” said Dantzig. (“The theory of codes could eventually be affected,” re- 
ported Browne.) “Is the Russian method practical?” asked Browne. “No,” 
said Dantzig. (“Mathematicians describe the discovery ...as a method by 
which computers can find solutions to a class of very hard problems that 
has hitherto been attacked on a hit-or-miss basis,” reported Browne.) 
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and similarly, (A) = Dj", 04_ (aij) for a rational mxn matrix A. If we 
consider a linear program L, say in the form 


maximize cx subject to Ax <b, 


and if we restrict ourselves to the case of A, b, and c rational (which is a 
reasonable assumption from a computational perspective), then the bit size 
of L is (L) = (A) + (b) + (ce). 

We say that an algorithm is a polynomial algorithm for linear pro- 
gramming if a polynomial p(x) exists such that for every linear program 
L with rational A, b, and c the algorithm finds a correct solution in at 
most p((L)) steps. The steps are counted in some of the usual models of 
computation, for example, as steps of a Turing machine (usually the chosen 
computational model is not crucial; whatever is polynomial in one model is 
also polynomial in other reasonable models). We stress right away that a 
single arithmetic operation is not counted as a single step here! We count as 
steps operations with single bits, and hence, addition of two k-bit integers 
requires at least k steps. 


Let us digress briefly from linear programming and let us consider 
Gaussian elimination, a well-known algorithm for solving systems of 
linear equations. For a system Ax = b, where (for simplicity) A is an 
nxn matrix and both A and b are rational, we naturally define the in- 
put size as (A) + (b). Is Gaussian elimination a polynomial algorithm? 
This is a tricky question! Although this algorithm needs only order of 
n3 arithmetic operations, the catch is that too large intermediate val- 
ues could come up during the computation, even if all entries in A 
and in b are small integers. If, for example, integers with as many as 
2” bits ensued, which can indeed happen in a naive implementation 
of Gaussian elimination, the computation would need exponentially 
many steps, although it would involve only n° arithmetic operations. 
(All of this concerns exact computations, while many implementations 
use floating-point arithmetic and hence the numbers are continually 
rounded. But then there is no guarantee that the results are correct.) 
We do not want to scare the reader needlessly: It is known how Gauss- 
ian elimination can be implemented in polynomial time. We want only 
to point out that this is not self-evident (and not too simple, either), 
and call attention to one kind of trouble that may develop in attempts 
at proving polynomiality. 


The ellipsoid method, as well as some of the interior point methods, are 
polynomial, while the simplex method with Bland’s rule (and with many 
other pivot rules too) is not polynomial.” 


? The nonpolynomiality is proved by means of the Klee-Minty cube; see Sec- 
tion 5.9. One has to check that an n-dimensional Klee—Minty cube can be rep- 
resented by input of size polynomial in n. 
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Strongly polynomial algorithms. For algorithms whose input is 
described by a sequence of integers or rationals, such as algorithms for 
linear programming, the number of arithmetic operations (addition, 
subtraction, multiplication, division, exponentiation) is also consid- 
ered, together with the number of bit operations. This often gives a 
more realistic picture of the running time, because contemporary com- 
puters usually execute an arithmetic operation as an elementary step, 
provided that the operands are not too large. 

A suitable implementation of Gaussian elimination is, on the one 
hand, a polynomial algorithm in the sense discussed above, and on 
the other hand, the number of arithmetic operations is bounded by a 
polynomial, namely by the polynomial Cn for a suitable constant C, 
where n is the number of equations in the system and also the number 
of variables. The number of arithmetic operations thus depends only 
on n, and it is the same for input numbers with 10 bits as for input 
numbers with a million bits. We say that Gaussian elimination is a 
strongly polynomial algorithm for solving systems of linear equa- 
tions. 

A strongly polynomial algorithm for linear programming would be 
one that, first, would be polynomial in the sense defined above, and 
second, for every linear program with n variables and m constraints 
it would find a solution using at most p(m-+ 7) arithmetic operations, 
where p(x) is a fixed polynomial. But no strongly polynomial algo- 
rithm for linear programming is known, and finding one is a major 
open problem. 

The ellipsoid method is not strongly polynomial. For every natural 
number M one can find a linear program with only 2 variables and 
2 constraints for which the ellipsoid method executes at least M arith- 
metic operations (the coefficients in such linear programs must have 
bit size tending to infinity as M — oo). In particular, the number of 
arithmetic operations for the ellipsoid method cannot be bounded by 
any polynomial in m+n. 


Ellipsoids. A two-dimensional ellipsoid is an ellipse plus its interior. An 
ellipsoid in general can most naturally be introduced as an affine transfor- 
mation of a ball. We let 


B” ={x eR": x?x <1} 
be the n-dimensional ball of unit radius centered at 0. Then an n-dimensional 
ellipsoid is a set of the form 

E={Mx+s:xe B"}, 


where M is a nonsingular nxn matrix and s € R” is a vector. The mapping 
xt Mx-+s is a composition of a linear function and a translation; this is 
called an affine map. 
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By manipulating the definition we can describe the ellipsoid by an in- 
equality: 
E ={y ¢R":M~'(y—s) € B™} 
={y eR": (y—s)"(M")7M“*'(y—s) <1} 
={y eR": (y—s)"Q°‘(y-s) < }j, (7.1) 


where we have set Q = M M7. It is well known and easy to check that such a 
Q is a positive definite matrix, that is, a symmetric square matrix satisfying 
x’Qx > 0 for all nonzero vectors x. Conversely, from matrix theory it is 
known that each positive definite matrix Q can be factored as Q = MMT 
for some nonsingular square matrix M. Therefore, an equivalent definition is 
that an ellipsoid is a set described by (7.1) for some positive definite Q and 
some s. 

Geometrically, s is the center of the ellipsoid E. If Q is a diagonal matrix 
and s = 0, then we have an ellipsoid in azial position, of the form 


fyer: 4p B44 Hei} 
qi = 922 Qnn 


The axes of this ellipsoid are parallel to the coordinate axes. The numbers 
JG1; V922;---; Inn are the lengths of the semiazes of the ellipsoid FE’, which 
may sound familiar to those accustomed to the equation of an ellipse of the 
form e+e = 1. Asis taught in linear algebra in connection with eigenvalues, 
each positive definite matrix Q can be diagonalized by an orthogonal basis 
change. That is, there exists an orthogonal matrix T such that the matrix 
TQT~' is diagonal, with the eigenvalues of Q on the diagonal. Geometrically, 
T represents a rotation of the coordinate system that brings the ellipsoid into 
axial position. 


A lion in the Sahara. A traditional mathematical anecdote gives directions 
for hunting a lion in the Sahara (under the assumption that there is at most 
one). We fence all of the Sahara, we divide it into two halves by another 
fence, and we detect one half that has no lion in it. Then we divide the other 
half by a fence, and we continue in this manner until the fenced piece of 
ground is so small that the lion cannot move and so it is caught, or if there 
is no lion in it, we have proved that there was none in the Sahara either. 
Although the qualities of this hunting guide can be disputed, for us it is 
essential that it gives a reasonably good description of the ellipsoid method. 
But in the real ellipsoid method we insist that the currently fenced piece is 
always an ellipsoid, even at the price that the lion can sometimes return to 
places from where it was expelled earlier; it is only guaranteed that the area 
of its territory shrinks all the time. 

The ellipsoid method doesn’t directly solve a linear program, but rather 
it seeks a solution of a system of linear inequalities Ax < b. But as we know, 
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this is sufficient for solving a linear program (see Section 6.1). For a simpler 
exposition we will first consider the following softened version of the problem: 


Together with the matrix A and vector b we are given rational numbers 
R>e>0. We assume that the set P = {x € R” : Ax < b} is contained in 
the ball B(0, R) centered at 0 with radius R. If P contains a ball of radius €, 
then the algorithm has to return a point y € P. However, if P contains no 
ball of radius ¢, then the algorithm may return either some y € P, or the 
answer NO SOLUTION. 


The ball B(0, R) thus plays the role of the Sahara and we assume that 
the lion, if present, is at least ¢ large. If there is only a smaller lion in the 
Sahara, it may escape or we may catch it—we don’t care. 

Under these assumptions, the ellipsoid method generates a sequence of 
ellipsoids Eo, F1,...,£,, where P C Ey for each k, as follows: 


1. Set k = 0 and Eo = B(O, R). 

2. Let the current ellipsoid E, be of the form E, = {y € R” : (y— 
sx)" Q;,'(y—-sk) < 1}. Ifs, satisfies all inequalities of the system Ax < b, 
return sx as a solution; stop. 

3. Otherwise, choose an inequality of the system that is violated by sz. 
Let it be the zth inequality; so we have als, > b;. Define Ex+1 as the 
ellipsoid of the smallest possible volume containing the “half-ellipsoid” 
Ay =E,N{x ER": alx < als; }; see the following picture: 


4. If the volume of Ex+, is smaller than the volume of a ball of radius ¢, 
return NO SOLUTION; stop. Otherwise, increase k by 1 and continue 
with Step 2. 


Let 4 denote the intersection of the ellipsoid E;, with the half-space 
{x € R" : al x < b;} defined by the ith inequality of the system. If P C Ex, 
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then also P C Hj, and the more so P C Hy. Why is the smallest ellipsoid 
containing H;, taken for E,+1, instead of the smallest ellipsoid containing Hj,? 
Purely in order to simplify the analysis of the algorithm, since the equation 
of Ex41 comes out less complicated this way. 


The ellipsoid Ex+1, i-e., the ellipsoid of the smallest volume con- 
taining H;, is always determined uniquely. For illustration we mention 
that it is given by 


Hl Qsr 
Sk41 = Sk — ——: ———==, 
peed sf Qsk 
n? 2 QsrsiQ 
Qk = SF (a. - =. Oe: ae 


We also leave without proof a fact crucial for the proof of correctness 
and efficiency of the ellipsoid method: We always have 


volume(Bitt) < g-1/(2n42) 
volume(E,) — , 


Hence the volume of the ellipsoid Ey, is at least e*/("+2) times smaller 
than the volume of the initial ball B(O, R). Since the volume of an n- 
dimensional ball is proportional to the nth power of the radius, for k 
satisfying R-e~*/"@"+?) < ¢ the volume of Ey is smaller than that 
of a ball of radius ¢. Such & provides an upper bound of [n(2n + 2) 
In(R/e)] on the maximum number of iterations. This is bounded by 
a polynomial in n + (R) + (e). 

So much for the simple and beautiful idea of the ellipsoid method— 
now we are coming to manageable but unpleasant complications. First 
of all, we cannot compute the ellipsoid E,+.1 exactly, at least not in 
rational arithmetic, since the defining formulas contain square roots. 
To get around this, E,.41 is computed only approximately with a suit- 
able precision. But one has to be careful so that P is still guaranteed 
to be contained in F,z41, and thus the approximate E,.+4; has to be 
expanded slightly. 

Another trouble arises, for example, when the same inequality is 
used for cutting the current ellipsoid in many iterations in a row. Then 
the ellipsoids may become too long (needle-like), and they have to be 
shortened artificially. 

Yet another problem is that we don’t really want to solve the soft- 
ened problem with R and ¢e, but an arbitrary system of linear inequal- 
ities without any restrictions. Here the bound on the bit size of the 
entries of A and b comes into play, through the following facts: 


(E1) (If a solution exists, then there is a not too large solution.) Let 
yp = (A) + (b) denote the input size for the system Ax < b. Then 
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this system has a solution if and only if the system 


Ax <b 


—-K <ay,<K 


has a solution, where kK = 2%. Clearly, all solutions of the latter 
system are contained in the ball B(0, R), where R = K\/n. 

(E2) (If a solution exists, then the solution set of a slightly relaxed 
system contains a small ball.) Let us put 7 = 27°”, ¢ = 2~°*, and 
let 7 be the n-component vector with all components equal to 1. 
Then the system Ax < b has a solution if and only if the system 
Ax < b+7 has a solution, and in such case the solution set of 
the latter system contains a ball of radius e. 


It is not hard to see how these facts can be used for solving an arbitrary 
system Ax < b by the ellipsoid method. Instead of this system we solve 
the softened problem by the ellipsoid method, but for a new system 
Ax <b+n, —K—-n < 2 < K+n,-K—-n <2 < K+n,...,-K-< 
In < K+, where K, R, ¢, and 7 are chosen suitably (first we add 
the constraints —K < x2; < K as in (E1), and then we apply (E2) to 
the resulting system). It is important that the bit size of R and ¢, as 
well as the input size of the new system, are bounded by a polynomial 
function of y. Thus the ellipsoid method runs in polynomial time, and 
it always finds a solution of Ax < b if it exists. 

We will not prove facts (E1) and (E2) here, but we sketch the basic 
ideas. For (E1) we first discuss the case n = 2 (in the plane). Let us 
consider a system of m inequalities 


ayix+ aay < bi, Gls Diet oe 0 


Let ¢; be the line {(z, y) € R? : ajx+ai2y = b;}. It is easy to calculate 
that the intersection of £; and @;, if it exists, has coordinates 


( aid; oe a52b; 5105 man aj; ) 


’ 
Qj2Qj1 — Aj14j2 Aj2Aj1 — Aj1 452 


If, for example, all a;; and 6; are integers with absolute value at most 
1000, then the coordinates of all intersections are fractions with nu- 
merators and denominators bounded by 2-10° in absolute value. Thus, 
if the solution set of the considered system of inequalities has at least 
one vertex, such a vertex has to lie in the square [—2 - 10°, 2 - 10°]?. 
If the solution set has no vertex and it is nonempty, it can be shown 
that it has to contain one of the lines @;, and that each @; intersects 
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the just-mentioned square. Fact (E1) can be verified along these lines 
for the considered system with two variables. For a general system in 
dimension n the idea is the same, and Cramer’s rule and a bound on 
the magnitude of the determinant of a matrix are used for estimating 
the coordinates of vertices of the solution set. 

Fact (E2) requires more work, but the idea is similar. Each solution 
x of the original system Ax < b also satisfies the modified system 
Ax < b+, and all x from the ball B(x, ¢) satisfy it as well, because 
changing x by € cannot change any coordinate of Ax by more than 7. 

If Ax < b has no solution, then by a suitable variant of the Farkas 
lemma, namely, Proposition 6.4.3(iii), there exists a nonnegative y € 
R™ such that y7 A = O07 and y’b < 0, and by normalizing y we may 
assume y’ b = —1. By Cramer’s rule again it is shown that there also 
exists a y with not too large components, and such y then witnesses 
unsolvability for the system Ax < b+ 77 as well. 

Here we finish the outline of the ellipsoid method. If some parts 
were too incomplete and hazy for the reader, we can only recommend 
a more extensive treatment, for instance in the excellent book 


M. Grétschel, L. Lovdsz, L. Schrijver: Geometric Algorithms 
and Combinatorial Optimization, 2nd edition, Springer, Hei- 
delberg 1994. 


(We have taken the Sahara metaphor from there, among others.) 


Why ellipsoids? They are used in the ellipsoid method since they 
constitute probably the simplest class of n-dimensional convex sets 
that is closed under nonsingular affine maps. Popularly speaking, this 
class is rich enough to approximate all convex polyhedra including flat 
ones and needle-like ones. If desired, ellipsoids can be replaced by sim- 
plices, for example, but the formulas in the algorithm and its analysis 
become considerably more unpleasant than those for ellipsoids. 


The ellipsoid method need not know all of the linear pro- 
gram. The system of inequalities Ax < b can also be given by means 
of a separation oracle. This is an algorithm (black box) that accepts 
a point s € R” as input, and ifs is a solution of the system, it returns 
the answer YES, while if s is not a solution, it returns one (arbitrary) 
inequality of the system that is violated by s. (Such an inequality sep- 
arates s from the solution set, and hence the name separation oracle.) 
The ellipsoid method calls the separation oracle with the centers sz 
of the generated ellipsoids, and it always uses the violated inequality 
returned by the oracle for determining the next ellipsoid. 

We talk about this since a separation oracle can be implemented 
efficiently for some interesting optimization problems even when the 
full system has exponentially many inequalities or even infinitely many 
(so far we haven’t considered infinite systems at all). 
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Probably the most important example of a situation in which an 
infinite system of linear inequalities can be solved by the ellipsoid 
method is semidefinite programming. In a semidefinite program 
we consider not an unknown vector x, but rather an unknown square 
matriz X = (xij)? ;-,. We optimize a linear function of the variables 
x;;. The constraints are linear inequalities and equations for the 2;,;, 
plus the requirement that the matrix X has to be positive semidefinite. 
The last constraint distinguishes semidefinite programming from lin- 
ear programming. It can be expressed by a system of infinitely many 
linear inequalities, namely, a’ Xa > 0 for each a € R”. A separa- 
tion oracle for this system can be constructed based on algorithms 
for computing eigenvalues and eigenvectors. The ellipsoid method can 
then approximate the optimum with a prescribed precision in poly- 
nomial time. (In reality, though, things are not quite as simple as it 
might seem from our mini-description.) 

Numerous computational problems can be solved in polynomial 
time via semidefinite programming, some of them exactly and some 
approximately, and sometimes this yields the only known polynomial 
algorithm. A nice example of application of semidefinite programming 
is the maximum cut problem (MAXCUT), in which the vertex 
set of a given graph G = (V, E) should be divided into two parts so 
that the maximum possible number of edges go between the parts. 
Semidefinite programming is an essential component of an approxi- 
mation algorithm for MAXCUT, called the Goemans—Williamson al- 
gorithm, that always computes a partition with the number of edges 
going between the parts at least 87.8% of the optimal number. This 
is the best known approximation guarantee, and most likely also the 
best possible one for any polynomial algorithm. More about this and 
related topics can be found, for instance, in the survey 


L. Lovasz: Semidefinite programs and combinatorial opti- 
mization, in Recent Advances in Algorithms and Combina- 
torics (B. Reed and C. Linhares-Sales, editors), pages 137- 
194, Springer, New York, 2003. 


Let us remark that in the just outlined applications, the ellipsoid 
method can be replaced by certain interior point methods (the so- 
called volumetric-center methods, which are not mentioned in our brief 
discussion of interior point methods in Section 7.2 below), and this 
yields algorithms efficient both in theory and in practice. See 


K. Krishnan, T.Terlaky: Interior point and semidefinite ap- 
proaches in combinatorial optimization, in: D. Avis, A. Hertz, 
and O. Marcotte (editors): Graph Theory and Combinatorial 
Optimization, Springer, Berlin etc. 2005, pages 101-158. 
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Theory versus practice. The notion of polynomial algorithm was 
suggested in the 1970s by Jack Edmonds as a formalized counterpart 
of the intuitive notion of an efficient algorithm. Today, a theoretician’s 
first question for every algorithm is, Is it polynomial? 

How is it possible that the ellipsoid method, which is polynomial, 
is much slower in practice than the simplex method, which is not poly- 
nomial? One of the reasons is that even though the ellipsoid method 
is polynomial, the degree of the polynomial is quite high. The second 
and main reason is that the simplex method is slow only on artifi- 
cially constructed linear programs, which it almost never encounters 
in practice, while the ellipsoid method seldom behaves better than in 
the worst case. But the “good behavior on all inputs with rare ex- 
ceptions” of the simplex method seems hard to capture theoretically. 
Moreover, a guaranteed efficiency for all inputs is much more satis- 
factory than only an empirically supported belief that an algorithm is 
usually fast. 

The notion of polynomial algorithm thus has great shortcomings 
from a practical point of view. But attempts at constructing a poly- 
nomial algorithm in theory usually also leads, over time, to practically 
efficient algorithms. An impressive example in the area of linear pro- 
gramming are interior point methods. 


7.2 Interior Point Methods 


The next time linear programming made it to press headlines was in 1984. 
Narendra Karmakar, a researcher at IBM, suggested an algorithm that is 
named after him and belongs to the large family of interior point methods. 
He proved its polynomiality and published results of computational experi- 
ments suggesting that in practice it is much faster than the simplex method. 
Although his statements in the latter direction turned out to be somewhat 
exaggerated, interior point methods are nowadays commonly used in linear 
programming and often they beat the simplex method, especially on very 
large linear programs. They are also applied with success to semidefinite 
programming and other important classes of optimization problems, such as 
convex quadratic programming. 

Interior point methods have been used for nonlinear optimization prob- 
lems at least since the 1950s. For linear programs they were actually tested by 
the early 1970s, and interestingly, none was found competitive to the simplex 
method. This was because theory and hardware were not advanced enough— 
indeed, interior point methods typically outperform the simplex methods only 
on problems so large that they were beyond the capabilities of the computers 
at that time, and moreover, efficient implementation of interior point meth- 
ods relies on powerful routines for solving large but sparse systems of linear 
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equations, which were not available either. The success story began only with 
Karmakar’s results. 


The basic approach. When solving a linear program, the simplex method 
crawls along the boundary of the set of feasible solutions. The ellipsoid 
method encircles the set of feasible solutions, and up until the last step it 
remains outside of it. Interior point methods walk through the interior of the 
set of feasible solutions toward an optimum, carefully avoiding the boundary. 
Only at the very end, when they get almost to an optimum, they jump to an 
exact optimum by a rounding step. See the following schematic picture: 


Working with an interior point all the time is the key idea that gave the 
methods their name. Interior points possess various pleasant mathematical 
properties, a kind of “nondegeneracy,” and they allow one to avoid intricacies 
of the combinatorial structure of the boundary, which have been haunting 
research of the simplex method for decades. The art of interior point methods 
is how to keep away from the boundary while still progressing to the optimum. 
(For the sake of exposition, let us now assume that we are dealing with a linear 
program in which some initial interior point is available.) 

There are several rather different basic approaches in interior point meth- 
ods, and each has many variants. Interior point methods in linear program- 
ming are classified as central path methods (or central trajectory methods), 
potential reduction methods, and affine scaling methods, and for almost every 
approach one can consider a primal version, a dual version, a primal—dual 
version, or a self-dual version. 

Here we will consider only central path methods, which have been com- 
putationally the most successful. We will present a single algorithm from 
this family: one with the best known theoretical complexity bounds and very 
good practical performance. It is fair to say, though, that a number of differ- 
ent interior point methods yield the same theoretical bounds and that many 
practitioners might say that, from their point of view, there are even better 
algorithms. 

The analysis of the algorithm, as well as some important details of its 
implementation, are somewhat complicated and we will not discuss them. 
Modern linear programming software is based on quite advanced mathemat- 
ics; for example, as was remarked above, one of the keys to the success of 
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interior point methods is an efficient solver of special systems of linear equa- 
tions, much more sophisticated than good old Gaussian elimination. 


In addition to the existence of innumerable variants of interior 
point methods in the literature, different expositions of the same al- 
gorithm may be based on different intuition. For example, what is 
presented as a step of Newton’s iteration method in one source may 
be derived by linear approximation in another source. We say this so 
that a reader who finds in the literature something seemingly rather 
different from what we say below is not confused more than necessary. 


The central path. First we explain the (mathematical) notion of central 
path. To this end, we consider an arbitrary convex polyhedron P in R” de- 
fined by a system Ax < b of m linear inequalities, and a linear objective 
function f(x) = c?x as usual. We introduce a family of auxiliary objective 
functions f,,, depending on a parameter pi € [0, 00): 


fu(x) =e"x4+y- S- In (b; — a;x), 


i=1 


where a; is the ith row of the matrix A (regarded here as a row vector). Thus 
fo = f is the original linear objective function, while the f, for > 0 are 
nonlinear, due to the logarithms. The function f,, is constructed in such a way 
that when x approaches the boundary of P, i.e., the difference of the right- 
hand side and left-hand side of some inequality approaches 0, then f, tends 
to —oo. The expression following yu in the definition of f,, is called a barrier 
function, or more definitely, a logarithmic barrier. The word barrier is more 
fitting for a minimization problem, in our case minimizing —f,,, where the 
graph of the objective function has barriers preventing minimization algo- 
rithms from hitting the walls. 

We thus consider the auxiliary problem of maximizing f,,(x) over P, for 
given ys > 0. Since f,, is undefined on the boundary of P, we actually maxi- 
mize over int(P), the interior of P, and so for the problem to make sense, we 
need to assume that int(P) 4 0. 

If we assume that, moreover, P is bounded, then f,, attains a maximum 
at a unique point in the interior of P, which we denote by x*(,1). 


Indeed, the existence of a maximum follows from the well-known 
fact that every continuous function attains a maximum on a compact 
set: The appropriate compact set is {x € int(P): fx(x) > fr(xo)}, 
where xo € int(P) is arbitrary (a little argument, which we leave to 
the reader, is needed to verify that this set is closed). 

As for uniqueness, let us assume that f,, attains a maximum at 
two distinct feasible points x and y. Then f,,(x) = f(y), and since 
fi, is easily seen to be concave (meaning that — f,, is convex), it follows 
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that f, has to be constant on the segment xy. Since the logarithm is 
strictly concave, this can happen only if Ax = Ay. But then P would 
contain all of the line {x + s(y — x) : s € R} and would not be 
bounded—a contradiction. In short, the maximum is unique because 
fy. is strictly concave. 

The condition of P bounded can be relaxed somewhat, but some 
condition forcing f,, to be bounded above on P is clearly necessary, 
as is documented by P = {x € R!: a, > 0} and either c = (1) or 
c = (0). 


If : is a very large number, the influence of the term c” 


point x*() by a dot):% 


On 
T 


the other hand, for small y the point x*(y) is close to an optimum of 


c’ x; see the illustrations below for 4 = 0.5 and yu = 0.1: 


the 


T 


{‘S) 


The central path is defined as the set {x*() : w > O}. We stress that 
central path is not associated with P itself, a rather with a particular 
system of inequalities defining P and a particular linear objective function 


Cc xX. 


3 A contour line of a function f : R? — R is a set of the form f~'({a}), aE R. 


xin f,, is negligible 
and x*(j) is a point “farthest from the boundary,” called the analytic center 
of P. The following picture shows, for a two-dimensional P, contour lines of 
function f,, for 4 = 100 (the vector c is depicted by an arrow and the 
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The idea of central path methods is to start at x*(~w) with a suitable 
large ys, and then follow the central path, decreasing 4 until an optimum of 
c’x is reached. Computing x*(j:) exactly would be difficult, and so actual 


algorithms follow the central path only approximately. 


A linear program in equational form and the primal—dual central 
path. Having introduced the concept of central path in a geometrically 
transparent way, we change the setting slightly. It turns out that for the ac- 
tual algorithm it is better to replace general inequality constraints by equa- 
tions and nonnegativity constraints; this is similar to the case of the simplex 
method. So now we consider the usual linear program in equational form 


maximize c’x subject to Ax = b, x > 0, (7.2) 


where A is an mxn matrix of rank m. Here the barrier function should 
prevent violation of the nonnegativity constraints (while the equations only 
restrict everything to a subspace of R” and they do not enter the objective 
function). So we set 


fulx) =e?x+p- Song; 


j=l 


and consider the auxiliary problem 
maximize f,,(x) subject to Ax = b, x > 0, 


where the notation x > 0 means that all coordinates of x are strictly positive. 

We would again like to claim that under suitable conditions, the auxiliary 
problem has a unique maximizer x* (jy) for every 4 > 0. Obviously, we need to 
assume that there is a feasible x > 0. Also, once we make sure that f,, attains 
at least one maximum, it is easily seen, as above, that the maximum is unique. 

We now derive necessary conditions for the existence of a maximum, and 
at the same time, we express x*(j1) as a solution of a system of equations. 
Later on, we will check that the necessary conditions are also sufficient. We 
derive these conditions by the method of Lagrange multipliers from anal- 
ysis. 

We recall that this is a general method for maximization of f(x) subject 
to m constraints gi(x) = 0, go(x) = 0,..-, gm(x) = 0, where f and q1,...,9m 
are functions from R” to R. It can be seen as a generalization of the basic 
calculus trick for maximizing a univariate function by seeking a zero of its 
derivative. It introduces the following system of equations with unknowns 
x € R” and y € R™ (the y; are auxiliary variables called the Lagrange 
multipliers): 


gi(X) = g2(x) = =9m(x) =0 and VF(x) =) viVailx), (7.3) 


Here V denotes the gradient (which by convention is a row vector): 
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vita) = (2202, 2608, 2608). 

Ox, ’ 0x2? ’? Oty 
That is, Vf is a vector function from R” to R” whose ith component is 
the partial derivative of f with respect to x;. Thus, the equation Vf(x) = 
yr  wiVoi(x) stipulates the equality of two n-component vectors. The 
method of Lagrange multipliers tells us that a maximum of f(x) subject 
to gi(x) = go(x) = --- = gm(x) = 0 occurs at x satisfying (7.3); that is, 
there exists y such that the considered x and this y together fulfill (7.3) 
(a special case of this result is derived in Section 8.7). Of course, we need 
some conditions on f and the g;. It suffices to require that f and the g; be 
defined on a nonempty open subset of R” and have continuous first partial 
derivatives there, and this will be obviously satisfied in our simple applica- 
tion. 

We apply the method of Lagrange multipliers to maximizing f,,(x) subject 
to Ax = b (the nonnegativity constraints are taken care of implicitly, by the 
barriers). So we set g;(x) = b; — a;x. Then, after a little manipulation, the 
system(7.3) becomes 


1 1 1 
Ax=b, c+ (2..2) = A’y. 


Z1 XQ In 


A more convenient form of this system is obtained by introducing an auxiliary 


nonnegative vector s = Ll - (<. = hoi +) € R”. We rewrite the relation of 
s and x to (8121, $2%2,...,$n2%n) = 1, with 1 denoting the vector of all 1’s. 


Then x*(j) is expressed as the x-part of a solution of the following system 
with unknowns x,s € R” and y € R”: 


Ax = 
ATy—s = 


($121, $2%2,..-,$nXn) 
x,s 


All of these equations are linear except for ($121, 82%2,...,8n%n) = pl. 


Although we have derived the system (7.4) assuming > 0, let 
us make a small digression and look at what (7.4) tells us for u = 0. 
Then, for nonnegative x and s, the equation ($121, $2%2,...,$8n%n) =O 
is equivalent to s'x = 0, and since s = A’ y — c, we have 0 = s?x = 
y! Ax — c?x = y’b — cc? x (using Ax = b). This may remind one 
of the equality of objective functions for the primal and dual linear 
programs. And indeed, if we take (7.2) as a primal linear program, the 
dual is 

minimize b’y subject to Ay >c, y € R™. (7.5) 
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Then (7.4) for = 0 tells us exactly that x is a feasible solution of 
the primal linear program (7.2), y is a feasible solution of the dual 
linear program (7.5) (here the s; serve as slack variables expressing 
the difference AT y —c), and the objective functions are equal! Hence 
such x and y are optimal. 


So far we have shown that if the problem of maximizing f,,(x) subject 
to Ax = b and x > O has a maximum at x”, then there exist s* > 0 and 
y* € R™ such that x*, y*,s* satisfy (7.4). Next, we formulate conditions for 
the existence of the maximum (we prove only their sufficiency, but it can be 
shown that they are also necessary), and we show that under these conditions, 
the maximum is characterized by (7.4). 


7.2.1 Lemma. Let us suppose that the linear program (7.2) has a feasible 
solution X > 0 and that the dual linear program (7.5) has a feasible solution 
y such that the slack vector 8 = A’ y —c satisfies 8 > 0. (Less formally, both 
the primal and dual linear programs have an interior feasible point.) Then for 
every . > 0 the system (7.4) has a unique solution x* = x*(1), y* = y*(), 
s* =s*(u), and x*(y) is the unique maximizer of f,, subject to Ax = b and 
x>0. 


Proof. Let p> 0 be fixed. We begin with the following claim. 


Claim. Under the assumptions of the lemma, the set Q = {x € R”: 
Ax =b,x > 0, f(x) > fy(x)} is bounded. 


Proof of the claim. We have 


(ise irks 
j= 

er =T (gi = 

=c x+y (b-Ax)+ uw Ina; (since Ax = b) 
j=l 

= (cP -§TA)x+H7 b+ p> Ina; 

j=l 
= -8'’x+y' b+ uw» Ina; (since A? y — 8 = c) 
j=l 
= yb + So (una; _— §;2;). 
j= 


The first term of the last line is a constant, and the rest is a 
sum of univariate functions. Each of these univariate functions 
is of the form ha(x) = wlnx — ax with u,a > 0. Elemen- 
tary calculus shows that ho(«) attains a unique maximum at 
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x =, and in particular, it is bounded from above. Moreover, 
for every constant C’, the set {x € (0,00) : ha(x%) > —C} is 
bounded. 

Setting K = f,,(x) —¥7b, we have Q C {x 0, 


jai ha, (zj) = K} S en? > 0: hs,(x) > K- 


iss maxzer hg, (2)}. The last set is a Cartesian product of 


bounded intervals and the claim is proved. 


So the set Q is bounded, and it is not hard to check that it is closed. 
Hence the continuous function f,, attains a maximum on it, which, as 
we know, is unique. This shows that f,, attains a maximum under the 
assumptions of the lemma, and by means of Lagrange multipliers we 
have shown that this maximum yields a solution of (7.4). It remains 
to verify that this is the only solution of (7.4). What we do is to show 
that for every solution X,y,5 of (7.4), X maximizes f,, (we note that 
5 and ¥y are uniquely determined by X through the relations s;7; = 
and A’y —s = ¢ from (7.4), using the assumption that A has full 
rank). 

Let X,Y, 5 be a solution of (7.4) and let x satisfy Ax = b and x > 0. 
Exactly as above we can express f,,(x) = y’ b+)", (ulna; —3;2;), 
and the right-hand side is maximized by setting 7; = p/3;, that is, 
for x = X. The lemma is proved. 


The set 
{ (x*(u),y" (1), s*(u)) ER": p> o} 


is called the primal—dual central path of the linear program (7.2), and 
this is actually what the algorithm will follow (approximately). 


The algorithm. The algorithm for solving the linear program (7.2) main- 
tains current vectors x,s € R” and y € R™, with x > O ands > O, that 
satisfy all of the linear equations in (7.4); that is, Ax = b and A?y —s=c. 
This makes sense, since it is the quadratic equations s;7; = js that make the 
problem complicated, and moreover, these are the only equations in (7.4) in 
which p enters. (We are still postponing the question of obtaining the initial 
X,y,s.) 

The current x, y,s will in general fail to satisfy the conditions s;x; = p, 
since we follow the primal—dual central path only approximately. We need to 
quantify by how much they fail to satisfy them, and one suitable “centrality” 
measure turns out to be 


cdist (x, s) = | (olsen, LL), p(S2%e, Lt), ers PA(SnEn, 1) 


where || - || is the Euclidean norm of a vector and p(a, ww) = /a/p — \/p/a. 
(This may look a little arbitrary, but the important thing is that it works. 


? 
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Other variants of algorithms following the central path may use different 
distance notions.) 

In a typical iteration of the algorithm, we have the current ys and x,y,s 
such that cdist,,(x,s) is sufficiently small; concretely, one can take the small- 
ness condition to be cdist,,(x,s) < /2. Then we decrease y slightly; in the 
considered algorithm we replace 4 by (1 — ame For this new p we again 
want to approximate the solution (x*(),y*(u),s*(w)) to (7.4) sufficiently 
closely. Since changed by only a little, we expect that the x, y,s from the 
previous iteration will be good initial guesses; in other words, in order to get 
x* (1), y*(u),8* (4), we need to change x, y,s by only a little. Let us denote 
the required changes by Ax, Ay, and As, respectively. So we look for Ax, 
Ay, As such that (7.4) is satisfied by x + Ax, y + Ay, and s+ As; that is, 


A(x + Ax) = b 
AT(y + Ay) —(s+As) = c 
((s: + Asi)(ai + Ati)... (8n + ASn)(tn + Atn)) =p 
x+Ax>0,s+As>0. 


l| 


Using the fact that x, y,s satisfy the linear equations in (7.4) exactly, that 
is, Ax = b and A’y —s = ¢, the first two lines simplify to AAx = 0 and 
AT Ay — As = 0. So far things were exact, but now we make a heuristic step: 
Since Ax and As are supposedly small compared to s and x, we will neglect 
the second-order products Axv;As; in the equation system from the third 
line. We will thus approximate the required changes in x, s, y by a solution 
to the following system: 


AAx = 0 
ATAy — As = 0 (7.6) 
(si; Aa, + 21As1,...,5nAXp + tis; ) = pl — (siz, Seis Saba) 


The unknowns are Ax, Ay, As, while x, y,s are regarded as constant. Hence 
this is a system of linear equations, and it is the system whose fast solution 
is a computational bottleneck of this interior point method. (For an actual 
computation the system can still be simplified by algebraic manipulations, 
but this is not our concern here.) In general we also need to worry about 
the positivity of x + Ax and s+ As, but in the algorithm considered below, 
luckily, it turns out that this is satisfied automatically—we will comment on 
this later. 


It can be shown that passing from x, s, y to x + Ax, y + Ay, 
s+ As, with Ax, Ay, As given by (7.6), can be regarded as a step of 
the Newton iterative method for solving the system (7.4). 


We are ready to describe the algorithm. The input consists of a real mxn 
matrix A of rank m, vectors b € R™, c € R”, and a real € > 0, which is a 
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parameter controlling the accuracy of the returned solution. (If we want an 
exact solution, the approximate solution found by the algorithm still has to 
be “rounded” suitably; we will not discuss this step.) 


1. Set pp = 1 and initialize x, y,s so that Ax = b, ATy -s=c,x>0, 
s > 0, and cdist,,(x,s) < V2. 

2. (Main loop) While ps > €, repeat Steps 3 and 4. As soon as pp < €, 
return x as an approximately optimal solution and stop. 


3. Replace js with (1 - xn) ph 


4. (Newton step) Compute Ax, Ay, As as the (unique) solution of the 
linear system (7.6). Replace x by x + Ax, y by y+ Ay, and s by 
s+ As. Go to the next iteration of the main loop. 


Step 1 is highly nontrivial—we know that finding a feasible solution of 
a general linear program is computationally as hard as finding an optimal 
solution—and we will discuss it soon. The rest of the algorithm has been 
specified completely, up to the way of solving (7.6) efficiently. 


What needs to be done in the analysis. Here we prove neither 
correctness of the algorithm nor a bound on its running time. These 
things are not really difficult but they are somewhat technical. We 
just outline what has to be done. 

Let us note that the centrality measure appears only in the first 
step of the algorithm (initialization). In the main loop we do not explic- 
itly check that the current x and s stay close to the central path—this 
has to be established in the analysis, and a similar thing holds for the 
conditions x > 0 and s > O. In other words, one needs to show that 
the following invariant holds for the current x and s in each iteration 
of the main loop: 


Invariant: x > 0,s > 0, and cdist,,(x,s) < V2. 


The next item is finiteness and convergence. It turns out that the 
algorithm always finishes, and moreover, it needs at most O(./7 log +) 
iterations. Last but not least, there is also the issue of rounding errors 
and numerical stability. This concludes our sketch of the analysis. 


Variations. The realm of interior point methods is vast, and even the 
number of papers on central path methods runs into the thousands. 
We mention just several ideas on how the described algorithm can 
be varied, with the aim of better practical convergence, numerical 
stability, etc. 


e (Higher-order methods) We have said that the computation of 
Ax, Ay, As is equivalent to a step of the Newton method. This 
method locally approximates the considered functions by linear 
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functions, based on their first derivatives. One can also employ 
higher-order methods, where the approximation is done by multi- 
variate polynomials, based on higher-order derivatives. 
(Truncated Newton steps) In the algorithm described above, luck- 
ily, it can be shown that making the full “Newton step,” i.e., going 
from x, y,s tox+Ax, y+Ay, s+As, cannot leave the feasible re- 
gion. For other algorithms this need not be guaranteed, and then 
one chooses a parameter a € (0, 1] in each iteration and moves 
only tox +aAx, y +aAy, s+ aAs, where a is determined so as 
to maintain feasibility, or also depending on other considerations. 
(Long-step methods) Decreasing yz by the factor 1— a is a care- 
ful, “short-step” strategy, designed so that we do not move too 
far along the central path and a small change again brings x, y,s 
close enough to the new point of the central path. In practice it 
seems advantageous to make longer steps, i.e., to decrease js more 
significantly. For example, some algorithms go from p to a [4 or so. 
There are even adaptive algorithms (where the new p is not given 
by an explicit formula) that asymptotically achieve quadratic con- 
vergence; that is, for 4 sufficiently small, a single steps goes from 
to const - p?. 

After such a large change of ju, it is in general not sufficient to 
make a single Newton step. Rather, one iterates Newton steps 
with yu fixed until the current x and s get sufficiently close to the 
central path. 

The theoretical analysis becomes more difficult for long-step meth- 
ods, and for some of the practically most successful such algo- 
rithms no reasonable theoretical bounds are known. 


Initialization. It remains to say how the first step of the algorithm, 
finding the initial x, y,s, can be realized. There are several approaches. 
Here we discuss one of them, an elegant method called a self-dual 
embedding. 


The idea is this: Given an input linear program, we set up another, 


auxiliary linear program with the following properties: 


(Pl) 


(P2) 


The auxiliary linear program is always feasible and bounded and 
there is a simple, explicitly specified vector lying on its cen- 
tral path, from which the above path-following algorithm can be 
started. 

From the optimal solution of the auxiliary linear program found by 
the algorithm we can read off an optimal solution of the original 
linear program or conclude that the original linear program is 
infeasible or unbounded. 


We develop the auxiliary linear program in several steps. Here 


things come out more nicely if we start with the original linear pro- 
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gram in an inequality form: 
Maximize c?x subject to Ax <b, x > 0. (7.7) 


As we have noted at the end of Section 6.1, the duality theorem implies 
that (7.7) is feasible and bounded if and only if the following system 
has a solution: 


Ax <b, ATy >c, b’y—c*x <0, xy >0 


(the first inequality is the feasibility of x, the second inequality is the 
feasibility of y for the dual linear program, and the third inequality 
forces equality of the primal and dual objective functions). Now comes 
a small trick: We introduce a new scalar variable 7 > 0 (we will use 
Greek letters to denote such “stand-alone” scalar variables) and we 
multiply the right-hand sides by it. In this way, we obtain a homoge- 
neous system (called the Goldman-Tucker system): 


Ax — Tb < 0 
—ATy +7¢< 0 


This system, being homogeneous, always admits the zero solution, 
but this is uninteresting. Interesting solutions, which allow us to solve 
the original linear program, are those with 7 > 0 or p > O, where 
p = p(x, y) =c’x—b’y denotes the slack in the last inequality. This 
is because of the following lemma: 


7.2.2 Lemma. No solution of (GTS) has both r and p nonzero, and 
exactly one of the following possibilities always occurs: 


(i) There is a solution (x,y,T) with t > 0 (and p = 0), in which case 
+x is an optimal solution of the primal linear program (7.7) and 
zy is an optimal solution of the dual. 

(ii) There is a solution (x,y,T) with p > 0 (and t = 0), in which case 
the primal linear program (7.7) is infeasible or unbounded. 


Proof (sketch). By weak duality it is immediate that any solution 
with 7 > 0 has p = 0 and yields a pair of optimal solutions. Conversely, 
by the (strong) duality theorem, any pair (x*, y*) of optimal solutions 
provides a solution of (GTS) with 7 = 1. Hence p > 0 implies t = 0 
and infeasibility or unboundedness of the primal linear program. It 
remains to check that infeasibility of the primal linear program or of 
the dual linear program gives a solution of (GTS) with p > 0. 

Let us assume, for example, that the primal linear program is in- 
feasible (the dual case is analogous); that is, Ax = b has no nontrivial 
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nonnegative solution. Then by the Farkas lemma (Proposition 6.4.1) 
there is a y with A’y > 0 and b’y < 0, and then x = 0, y = y, 
T = 0 is a solution to (GTS) with p = —b’y > 0. 


As a side remark, we note that applying the Farkas lemma to the 
Goldman—Tucker system yields an alternative derivation of the duality 
theorem from the Farkas lemma. 


To boost the moral, we note that we have achieved something like 
(P2): By means of solutions to (GTS) of a special kind we can solve 
the original linear program, as well as deal with its infeasibility or 
unboundedness. But how do we compute a solution with p > 0 or 
T > 0, avoiding the trivial zero solution? 

Luckily, interior point methods are very suitable for this, since they 
converge to a “most generic” optimal solution of the given linear pro- 
gram. Roughly speaking, the interior point algorithm described above 
converges to the analytic center of the set of all optimal solutions, 
and this particular optimal solution does not satisfy any constraint 
(inequality or nonnegativity constraint) with equality if this can be 
avoided at all. In particular, if we made (GTS) into a linear program 
by adding a suitable objective function, a “most generic” optimal so- 
lution would have tT > 0 or p > 0. 

There is still an obstacle to be overcome: We do not have an explicit 
feasible interior point for (GTS) to start the algorithm, and what 
is worse, no feasible interior point (one with all variables and slacks 
strictly positive) exists! This is because we always have r = 0 or p = 0, 
as was noted in the lemma. 

The next step is to enlarge the system so that the vector of all 1’s 
is “forced” as a feasible interior point. Before proceeding we simplify 
the notation: We write the system (GTS) in matrix form as Mou < 0, 
u > 0, where 


We note that Mf = —Mp; that is, Mo is a skew-symmetric matric. If 
the original matrix A has size mxn, then Mp is a kxk matrix with 
k=n+m+l. 

Now we set up the appropriate linear program with a feasible inte- 
rior point, and then we gradually explain how and why it works. We 
define a vector r € R* by r = 1+ Mol, a (k+1) x (k+1) matrix M by 
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(we note that M is skew-symmetric too, which will be useful), and a 
vector q € R*+! by q = (0,0,...,0,k+1). We consider the following 
linear program with variable vector v = (u,¥) (that is, u with a new 
variable ) appended to the end): 


Maximize — q/ v subject to Mv <q, v > 0. (SD) 


First we explain the abbreviation (SD). It stands for self-dual, and 
indeed, if we form the dual linear program to (SD), we obtain 


minimize q!'w subject to M7 w > —q, w > 0. 


Using M? = —M and changing signs, which changes minimization 
to maximization and flips the direction of the inequality, we arrive 
exactly at (SD). So (SD) is equivalent to its own dual linear program. 

For a feasible solution v of (SD) we define the slacks by z = z(v) = 
q— Mv (“z for zlacks”). Here is a key notion: A feasible solution v of 
(SD) is called strictly complementary if for every j = 1,2,...,4+1 
we have v; > 0 or z; > 0. The next lemma shows that a strictly 
complementary optimal solution is exactly what we need for solving 
the original linear program (7.7) (more precisely, we don’t need the 
full force of strict complementarity, only a strict complementarity for 
a particular 7). 


7.2.3 Lemma. The linear program (SD) is feasible and bounded, ev- 
ery optimal solution has ) = 0, and hence its u-part is a solution of 
the Goldman—Tucker system (GTS). Moreover, every strictly comple- 
mentary optimal solution yields a solution of (GTS) with t > 0 or 
p> 0. 


Proof. We have 0 as a feasible solution of (SD), and also of its dual. 
Thus (SD) is feasible and bounded. For v = w = 0 both the primal 
and dual objective functions have value 0, so 0 must be their common 
optimal value. It follows that every optimal solution has 0 = 0. The 
rest of the lemma is easily checked. 


Hence for solving the original linear program (7.7) it suffices to 
find a strictly complementary optimal solution of (SD). It turns out 
that this is exactly what the algorithm described above computes. 

We will not prove this in full, but let us see what is going on. In 
order to apply the algorithm to (SD), we first need to convert (SD) to 
equational form by adding the slack variables: 


Maximize — q’ v subject to Mv +z=q, v,z>0. 


Now we want to write down the system (7.4) specifying points on 
the central path for the considered linear program. So we substitute 


7.2 Interior Point Methods 129 


A= (M|Ipii) (so Ais a (k+1) x 2(k+1) matrix), b = q, c = (—q,0), 
and x = (v,z). We also need to introduce the extra variables y and 
S appearing in (7.4): For reasons that will become apparent later, we 
write v’ for y and (z’,z”) for s, where z’ and z” are (k+1)-component 
vectors. So we have v,z,z’,z” nonnegative, while v’ is (so far) arbi- 
trary. 

The equation Ax = b from (7.4) becomes Mv + z = q. From 
ATy —s=c we get two equations: M7 v/ — 2’ = —q and v’ —z” = 0. 
The first of these two can be rewritten as Mv’ + z’ = q. The second 
just means that v’ = 2” > 0, and so 2” can be disregarded if we add 
the constraint v’ > 0. Finally, sj; = y in (7.4) yields vjz; = js and 
viz; = p for all j = 1,2,...,k+1. The full system is thus 


Mv+z=q 
Mv'+z2z/ =q 
ujz, = p for allj =1,2,...,k+1 
2; = wfor all j =1,2,...,k+1 
v,z,v',z' > 0. 


UV 


It is easily verified, using the skew-symmetry of Mo, that the sys- 
tem Mv +z = q is satisfied by v = z = 1 (M and q were set up 
that way). Thus v = z= v’ =z’ = 1 is a solution of the just-derived 
system with pw = 1, and we can use it as an initial point on the central 
path in Step 1 of the algorithm. To complete the analysis one needs 
to show that the algorithm converges to a strictly complementary op- 
timal solution, and as we said above, this part is omitted. 

As a final remark to the algorithm we note that the system above 
can be simplified. We know that (7.4) in general has a unique solution 
(provided that the primal and dual linear programs both have a feasi- 
ble interior point, which is satisfied in our case). At the same time, we 
observe that if v,z,v’,z’ is a solution, then interchanging v with v’ 
and z with z’ also yields a solution, and so uniqueness implies v = v’ 
and z = z’. Therefore, it is sufficient to work with the simpler system 


Mv+2z=q, 032; =p for j =1,2,...,k+1, z,v > 0 


and to set up the corresponding linear system for the changes Av and 
Az accordingly. This concludes the description and partial analysis of 
the considered interior point algorithm. 


Let us remark that the self-dual embedding trick is quite universal: 
Almost any interior point algorithm can be used for computing an 
optimum of the self-dual linear program constructed as above, and 
this yields an optimal solution of the original linear program. 


Computational complexity of interior point methods. Several interior 
point methods for linear programming are known to be (weakly) polynomial 
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algorithms, including the one given above (with the self-dual embedding). The 
total number of bit operations for the best of these algorithms is bounded by 
O(n3L), where L is the maximum bit size of coefficients in the linear program 
and n is the number of variables. The maximum number of iterations before 
reaching an optimum is O(.,/nL). On the other hand, examples are known 
(based on the Klee—Minty cube, again!) for which any algorithm following 
the central path must make Q(,/n log n) iterations; see 


A. Deza, E . Nematollahi, and T. Terlaky: How good are interior point 
methods? Klee—Minty cubes tighten iteration-complexity bounds, 
Technical Report, McMaster University, 2004. 


In practice the number of iterations seems to be bounded by a constant or 
by O(log n) in almost all cases. This is somewhat similar to the situation for 
the simplex method, where the worst-case behavior is much worse than the 
behavior for typical inputs. 


Our presentation of interior point methods is inspired mostly by 


T. Terlaky: An easy way to teach interior-point methods, Eu- 
ropean Journal of Operational Research 130, 1(2001), 1-19, 


and full proofs of the results in this section can be found in 


C. Roos, T. Terlaky, and J.-P. Vial: Interior Point Methods for 
Linear Optimization (2nd edition), Springer, Berlin etc., 2005. 


A compact survey is 


F. A. Potra, $.J. Wright: Interior-point methods, Journal of 
Computational and Applied Mathematics 124(2000), pages 
281-302; 


at time of writing this text it was accessible at websites of the authors. 
Several more books from the late 1990s are cited in these papers, and 
an immense amount of material can be found online. 


8. More Applications 


Here we have collected several applications of linear programming, and in 
particular, of the duality theorem. They are slightly more advanced than 
those in Chapters 2 and 3, but we have tried to keep everything very con- 
crete and as elementary as possible, and we hope that even a mathematically 
inexperienced reader will have no problems enjoying these small gems. 


8.1 Zero-Sum Games 


The Colonel Blotto game. Colonel Blotto and his opponent are preparing 
for a battle over three mountain passes. Each of them commands five regi- 
ments. The one who sends more regiments to a pass occupies it, but when the 
same number of regiments meet, there will be a draw. Finally, the one who 
occupies more passes than the other wins the battle, with a draw occurring 
if both occupy the same number of passes. 

Given that all three passes have very similar characteristics, the strate- 
gies independently pursued by both Colonel Blotto and his opponent are 
the following: First they partition their five regiments into three groups. For 
example, the partition (0, 1,4) means that one pass will be attacked by 4 reg- 
iments, another pass by 1 regiment, and one pass will not be attacked at all. 
Then, the groups are assigned to the passes randomly; that is, each of the 
3! = 6 possible assignments of groups to passes is equally likely. 

The partitions of Colonel Blotto and his opponent determine winning 
probabilities for both of them (in general, these do not add up to one because 
of possible draws). Both Colonel Blotto and his opponent want to bias the 
difference of these probabilities in their direction as much as possible. How 
should they choose their partitions? 

This is an instance of a finite two-player zero-sum game. In such a 
game, each of the two players has a finite set of possible strategies (in our 
case, the partitions), and each pair of opposing strategies leads to a payoff 
known to both players. In our case, we define the payoff as Colonel Blotto’s 
winning probability minus the opponent’s winning probability. Whatever one 
of the players wins, the other player loses, and this explains the term zero- 
sum game. To some extent, it has become a part of common vocabulary. 
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When we number the strategies 1,2,...,m for the first player and 
1,2,...,n for the second player, the payoffs can be recorded in the form 
of an m x n payoff matrix. In the Colonel Blotto game, the payoff matrix 
looks as follows, with the rows corresponding to the strategies of Colonel 
Blotto and the columns to the strategies of the opponent. 


(0055) (0,4) (02,3) (1,53) "G2,2) 
(0, 0,5) 0 —+ —+ =I =a) 
(0, 1,4) z 0 0 -$ -2 
(0, 2, 3) $ 0 0 0 + 
(173) i ; 0 0 —3 
(1,2, 2) 1 2 -| 7 0 


For example, when Colonel Blotto chooses (0,1,4) and his opponent 
chooses (0,0,5), then Colonel Blotto wins (actually, without fighting) if and 
only if his two nonempty groups arrive at the two passes left unattended by 
his opponent. The probability for this to happen is 3. With probability 7 
there will be a draw, so the difference of the winning probabilities is +—0 = 5. 

Not knowing what the opponent is going to do, Colonel Blotto might want 
to choose a strategy that guarantees the highest payoff in the worst case. The 
only candidate for such a strategy is (0, 2,3): No matter what the opponent 
does, Colonel Blotto will get a payoff of at least 0 with this strategy, while all 
other strategies lead to negative payoff in the worst case. (Anticipating that 
a spy of the opponent might find out about his plans, he must reckon that 
the worst case will actually happen. The whole game is not a particularly 
cheerful matter anyway.) In terms of the payoff matrix, Colonel Blotto looks 
at the minimum in each row, and he chooses a row where this minimum is 
the largest possible. 

Similarly, the opponent wants to choose a strategy that guarantees the 
lowest payoff (for Colonel Blotto) in the worst case. It turns out that (0, 2,3) 
is also the unique such choice for the opponent, because it guarantees that 
Colonel Blotto will receive payoff at most 0, while all other strategies allow 
him to achieve a positive payoff if he happens to guess or spy out the op- 
ponent’s strategy. In terms of the payoff matrix, the opponent looks at the 
maximum in each column, and he chooses a column where this maximum is 
the smallest possible. 

We note that if both Colonel Blotto and his opponent play the strategies 
selected as above, they both see their worst expectations come true, exactly 
those on which they pessimistically based their choice of strategy. Seeing the 
worst case happen might shatter hopes for a better outcome of the battle, 
but on the other hand, it is a relief. After the battle has been fought, neither 
Colonel Blotto nor his opponent will have to regret their choice: Even if both 
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had known the other’s strategy in advance, neither of them would have had 
an incentive to change his own strategy. 

This is an interesting feature of this game: The strategy selected by 
Colonel Blotto and the strategy selected by his opponent as above are best 
responses against one another. In terms of the payoff matrix, the entry 0 in 
the row ((0,2,3) and column (0,2,3)) is a “saddle point”; it is a minimum 
in its row and a maximum in its column. A pair of strategies that are best 
responses against one another is called a Nash equilibrium of the game. 
As we will see next, not every game has a Nash equilibrium in this sense. 


The Rock-Paper-Scissors game. Alice and Bob independently choose a 
hand gesture indicating either a rock, a piece of paper, or a pair of scissors. 
If both players choose the same gesture, the game is a draw, and otherwise, 
there is a cyclic pattern: Scissors beats paper (by cutting it), paper beats 
rock (by wrapping it up), rock beats scissors (by making it blunt). Assuming 
that the payoff to Alice is 1 if she wins, 0 if there is a draw, and —1 if she 
loses, the payoff matrix is 


rock paper scissors 


rock 0 —1 1 
paper 1 0 —1 
scissors | —1 1 0 


This game has no Nash equilibrium in the sense explained above. No entry of 
the payoff matrix is a minimum in its row and a maximum in its column at 
the same time. In more human terms, after every game, the player who lost 
may regret not to have played the gesture that would have beaten the gesture 
of the winner (and both may regret in the case of a draw). It is impossible for 
both players to fix strategies that are best responses each against the other. 

But when we generalize the notion of a strategy, there is a way for both 
players to avoid regret. Both should decide randomly, selecting each of the 
gestures with probability 1/3. Even this strategy may lose, of course, but still 
there is no reason for regret, since with the same probability 1/3, it could 
have won, and the fact that it didn’t is not a fault of the strategy but just 
bad luck. Indeed, in this way both Alice and Bob can guarantee that their 
payoff is 0 in expectation, and it is easy to see that neither of them can do 
better by unilaterally switching to a different behavior. We say that we have 
a mized Nash equilibrium of the game (formally defined below). 

A surprising fact is that every zero-sum game has a mixed Nash equilib- 
rium. It turns out that such an equilibrium “solves” the game in the sense 
that it tells us (or rather, both of the two players) how to play the game 
optimally. As we will see in examples, the random decisions that are involved 
in a mixed Nash equilibrium need not give each of the possible strategies 
the same probability, as was the case in the very simple Rock-Paper-Scissors 
game. However, we will prove that suitable probability distributions always 
exist and, moreover, that they can be computed using linear programming. 


134 8. More Applications 


Existence and computation of a mixed Nash equilibrium. Let us 
repeat the setup of zero-sum games in a more formal manner. We have two 
players, and we stick to calling them Alice and Bob. Alice has a set of m 
pure strategies at her disposal, while Bob has a set of n pure strategies (we 
assume that m,n > 1). 

Then there is an mxn payoff matrix M of real numbers such that m,; is 
Alice’s gain (and Bob’s loss) when Alice’s ith pure strategy is played against 
Bob’s jth pure strategy. For concreteness, we may think of Bob having to 
pay €mjj to Alice. Of course, the situation is symmetric in that m;; might 
be negative, in which case Alice has to pay €—m,; to Bob. 

A mixed strategy of a player is a probability distribution over his or 
her set of pure strategies. We encode a mixed strategy of Alice by an m- 
dimensional vector of probabilities 


m 
x = (%,...,2m), Sa 1, x>0, 
i=1 
and a mixed strategy of Bob by an n-dimensional vector of probabilities 
n 
Y= Yisexistn)s SH hy a0 
j=l 


So a mixed strategy is not a particular case of a pure strategy; in the Rock- 
Paper-Scissors game, Alice has three possible pure strategies (rock, paper, 
and scissors), but infinitely many possible mixed strategies: She can choose 
any three nonnegative real numbers 21, %2,23 with 7; + 72 +23 = 1, and 
play rock with probability x1, paper with probability x2, and scissors with 
probability x3. Each such triple (x1, x2, 23) specifies a mixed strategy. 

Given mixed strategies x and y of Alice and Bob, the expected payoff 
(expected gain of Alice) when x is played against y is 


y mj; Prob[Alice plays i, Bob plays J] 
— xy 
ij 
= ag mj; Prob,[Alice plays 7] - Prob, [Bob plays j] 


ig Mig LiYs 
= x’ My. 


Now we are going to formalize the tenet of Colonel Blotto: “Prepare for 
the worst.” When Alice considers playing some mixed strategy x, she expects 
Bob to play a best response against x: a strategy y that minimizes her 
expected payoff x? My. Similarly, for given y, Bob expects Alice to play a 
strategy x that. maximizes x? My. 

For a fixed matrix M, these worst-case payoffs are captured by the fol- 
lowing two functions: 
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B(x) = _ x! My, a(y) = maxx! My. 
So 8(x) is the best (smallest) expected payoff that Bob can achieve against 
Alice’s mixed strategy x, and similarly, a(y) is the best (largest) expected 
payoff that Alice can achieve against Bob’s y. It may also be worth noting 
that yo is Bob’s best response against some x exactly if x7 Myo = 3(x) (the 
symmetric statement for Alice is left to the reader). 

Let us note that @ and a are well-defined functions, since we are opti- 
mizing over compact sets. For (3, say, the set of all x representing probability 
distributions is an (m—1)-dimensional simplex in R™, and hence indeed com- 
pact. 


8.1.1 Definition. A pair (x,y) of mixed strategies is a mixed Nash equi- 
librium of the game if x is a best response against y and y is a best response 
against x (the adjective “mixed” is often omitted); in formulas, this can be 
expressed as 

B(&) = x" My = aly). 


In the Colonel Blotto game, we have even found a (pure) Nash 
equilibrium (a pair of pure strategies that are best responses against 
each other). However, the strategies themselves involved random de- 
cisions. We regard these decisions as “hard-wired” into the strategies 
and the payoff matrix. 

Alternatively, we can consider each fixed assignment of regiments 
to passes as a pure strategy. Then we have a considerably larger payoff 
matrix, and there is no pure Nash equilibrium. Rather, we have a 
mixed Nash equilibrium. In practical terms it amounts to the same 
thing as the strategies described in the previous interpretation of the 
game, namely, dividing the regiments in groups of 3, 2, and 0, and 
sending one group to each pass at random. 

These are two different views (models) of the same game, and we 
are free to investigate either one, although one may be more convenient 
or more realistic than the other. 


Let us say that Alice’s mixed strategy x is worst-case optimal if G(x) = 
max, 3(x). That is, Alice expects Bob to play his best response against every 
mixed strategy of hers, and she chooses a mixed strategy x that maximizes her 
expected payoff under this (pessimistic) assumption. Similarly, Bob’s mixed 
strategy y is worst-case optimal if a(y) = miny a(y). 

The next simple lemma shows, among other things, that in order to attain 
a Nash equilibrium, both players must play worst-case optimal strategies. 


8.1.2 Lemma. 


(i) We have max, 3(x) < miny a(y). Actually, for every two mixed strategies 
x and y we have 3(x) < x? My < aly). 
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(ii) If the pair (x,y) of mixed strategies forms a mixed Nash equilibrium, 
then both x and y are worst-case optimal. 

(iii) If mixed strategies x and y satisfy B(x) = a(y), then they form a mixed 
Nash equilibrium. 


Proof. It is an amusing mental exercise to try to “see” the claims of the 
lemma by thinking informally about players and games. But a formal proof 
is routine, which is a nice demonstration of the power of mathematical for- 
malism. 

The first sentence in (i) follows from the second one, which in turn is an 
immediate consequence of the definitions of a and @. 

In (ii), for any x we have 3(x) < a(y) by (i), and since G(x) = a(y), we 
obtain G(x) < G(x). Thus X is worst-case optimal, and a symmetric argument 
shows the worst-case optimality of y. This proves (ii). 

As for (iii), if G(%) = a(y), then by (i) we have G(X) = 7 My = a(y), 
and hence (xX, y) is a mixed Nash equilibrium. The lemma is proved. 


Here is the main result of this section: 


8.1.3 Theorem (Minimax theorem for zero-sum games). For every 
zero-sum game, worst-case optimal mixed strategies for both players exist 
and can be efficiently computed by linear programming. If x is a worst- 
case optimal mixed strategy of Alice and y is a worst-case optimal mixed 
strategy of Bob, then (X,y) is a mixed Nash equilibrium, and the number 
B(x) = x? My = a(¥) is the same for all possible worst-case optimal mixed 
strategies x and y. 


The value x? My, the expected payoff in any Nash equilibrium, is called 
the value of the game. Together with Lemma 8.1.2(ii), we get that (x,y) 
forms a mixed Nash equilibrium if and only if both x and y are worst-case 
optimal. 

This theorem, in a sense, tells us everything about playing zero-sum 
games. In particular, “Prepare for the worst” is indeed the best policy (for 
nontrivial reasons!). If Alice plays a worst-case optimal mixed strategy, her 
expected payoff is always at least the value of the game, no matter what strat- 
egy Bob chooses. Moreover, if Bob is well informed and plays a worst-case 
optimal mixed strategy, then Alice cannot secure an expected payoff larger 
than the value of the game, no matter what strategy she chooses. So there 
are no secrets and no psychology involved; both players can as well declare 
their mixed strategies in advance, and nothing changes. 


Of course, if there are many rounds of the game and Alice suspects 
that Bob hasn’t learned his lesson and doesn’t play optimally, she 
can begin to contemplate how she could exploit this. Then psychology 
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does come into play. However, by trying strategies that are not worst- 
case optimal, she is taking a risk, since she also gives Bob a chance to 
exploit her. 


It remains to explain the name “minimax theorem.” If we consider the 
equality G(x) = a(y) and use the definitions of 3,a, and of worst-case opti- 
mality of x and y, we arrive at 


max minx? My = minmaxx’ My, 
x y y x 
and this is the explanation we offer. 

The relation of Theorem 8.1.3 to Lemma 8.1.2(i) is similar to the relation 
of the duality theorem of linear programming to the weak duality theorem. 
And indeed, we are going to use the duality theorem in the proof of Theo- 
rem 8.1.3 in a substantial way. 


Proof of Theorem 8.1.3. We first show how worst-case optimal mixed 
strategies x for Alice and y for Bob can be found by linear programming. 
Then we prove that the desired equality G(x) = a(y) holds. 

We begin by noticing that Bob’s best response to a fixed mixed strategy 
x of Alice can be found by solving a linear program. That is, B(x), with x 
a concrete vector of m numbers, is the optimal value of the following linear 
program in the variables y1,..., Yn: 


Minimize x? My 
subject to Yj_1 yj; =1 (8.1) 
y 20. 


So we can evaluate G(x). But for finding a worst-case optimal strategy of 
Alice we need to maximize 3. Unfortunately, G(x) is not a linear function, so 
we cannot directly formulate the maximization of 3(x) as a linear program. 
Fortunately, we can circumvent this issue by using linear programming dual- 
ity. 
Using the dualization recipe from Section 6.2, we write down the dual of 

(8.1): 

Maximize 29 

subject to M?x— 129 > 0 


(this is a nice exercise in dualization). This dual linear program has only 
one variable x, since %1,...,@%m are still regarded as fixed numbers. By the 
duality theorem, the optimal value of the dual linear program is the same as 
that of the primal, namely, 3(x). 

In order to maximize G(x) over all mixed strategies x of Alice, we set up 
a new linear program that optically looks exactly like the previous one, but 
in which 21,...,@%m are regarded as variables (this works only because the 
constraints happen to be linear in xo, 21,...,2%m): 
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Maximize 29 
subject to Mx — 1a) >0 
~ 8.2 
ee r=1 ee) 


x > 0. 


If (Zo, x) denotes an optimal solution of this linear program, we have by 
construction 


fo = B(&) = max 6(x). (8.3) 
In a symmetric fashion, we can derive a linear program for solving Bob’s 
task of computing a best strategy y. We obtain the problem 
minimize yo 
subject to My —1yo <0 


n 8.4 
ee yy =1 (8-4) 
y20 
in the variables yo, y1,.--,Yn. Now an optimal solution (go, y) satisfies 
fo = a(¥) = min aly), (8.5) 


So both x and y are worst-case optimal strategies (and conversely, worst- 
case optimal strategies provide optimal solutions of the respective linear pro- 
grams). 

The punchline is that the two linear programs (8.2) and (8.4) are dual 
to each other! Again, the dualization recipe shows this. It follows that both 
programs have the same optimum value % = jo. Hence G(x) = a(y) and 
(x,y) is a Nash equilibrium by Lemma 8.1.2(iii). 


Rock-Paper-Scissors revisited. To kill the time between their rare public 
appearances, Santa Claus and the Easter Bunny play the rock-paper-scissors 
game against each other. The Easter Bunny, however, cannot indicate a pair 
of scissors with his paw and is therefore limited to two pure strategies. The 
payoff matrix in this variant is 


rock paper 


rock 0 —1 
paper 1 0 
scissors —l 1 


We already see that Santa Claus should never play rock: For any possible 
gesture of the Easter Bunny, paper is a better strategy. 

Let us apply the machinery we have just developed to find optimal mixed 
strategies for Santa Claus and the Easter Bunny. Recall that Santa Claus 
has to solve the linear program (8.2) to find the probability distribution 
X = (%,%2,%3) that determines his optimal strategy. At the same time, he 
will compute the game value %o, his expected gain. 
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The linear program is 


maximize 2% 


subject to t2 — £23 — Lp > 0 
— a + 23 — £ 2 0 

1+ 42 + 2X3 = 1 

21,22,2%3 > 0 


j=) 
wl 
we 

Ww 


A (unique) optimal solution is (%o,%1,%2,%3) = (5, 
The Easter Bunny’s problem (8.4) is 


minimize yo 


subject to — y2- yw <0 
Yi = Yee 0) 

= 90 ye Gos, 0 

Yi + Yeo =1 

Yisy2 2 0 


A (unique) optimal solution is (Jo, #1, 2) = (4, 4 4): 


Let us summarize: If both play optimally, Santa Claus wins + on average 
(this is a scientific explanation of why Santa Claus can afford to bring more 
presents!). Both play paper with probability 2. With the remaining proba- 
bility s, Santa Claus plays scissors, while the Easter Bunny plays rock. This 
result is a simple but still nontrivial application of zero-sum game theory. 


In retrospect, the original rock-paper-scissors game might ap- 
pear rather boring, but this is relative: There is a World RPS So- 
ciety (http: //www.worldrps.com/) that holds an annual rock-paper- 
scissors world championship and sells a book on how to play the game. 


Choosing numbers. Here is another game, which actually seems fun to play 
and in which the optimal mixed strategies are not at all obvious. Each of the 
two players independently writes down an integer between 1 and 6. Then the 
numbers are compared. If they are equal, the game is a draw. If the numbers 
differ by one, the player with the smaller number gets €2 from the one with 
the larger number. If the two numbers differ by two or more, the player with 
the larger number gets €1 from the one with the smaller number. We want 
to challenge the reader to compute the optimal mixed strategies for this game 
(by symmetry, they are the same for both players). 


The Colonel Blotto game was apparently first considered by the 
mathematician Emile Borel in 1921 (he also served as a French minis- 
ter of the navy, although only for several months). It can be considered 
for any number of regiments; for 6 regiments, there is still a Nash equi- 
librium defined by a pair of pure strategies, but for 7 regiments (this is 
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the case stated in Borel’s original paper) it becomes necessary to mix. 
Borel’s paper does not mention the name “Colonel Blotto”; this name 
appears in Hubert Phillips’s Week-end Problems Book, a collection of 
puzzles from 1933. 

In his paper, Borel considers symmetric games in general. A sym- 
metric game is defined by a payoff matrix M with M7 = —M. Borel 
erroneously states that if the number n of strategies is sufficiently 
large, one can construct symmetric games in which each player can 
secure a positive expected payoff, knowing the other player’s mixed 
strategy. He concludes that playing zero-sum games requires psychol- 
ogy, on top of mathematics. 

It was only in 1926 that John von Neumann, not knowing about 
Borel’s work and its pessimistic conclusion, formally established The- 
orem 8.1.3. 


Bimatrix games. An important generalization of finite zero-sum 
games is bimatrix games, in which both Alice and Bob want to max- 
imize the payoff with respect to a payoff matrix of their own, A for 
Alice, and B for Bob (in the zero-sum case, A = —B). A bimatrix 
game also has at least one mixed Nash equilibrium: a pair of strate- 
gies x,y that are best responses against each other, meaning that 
x? AY = maxx x? AY and x7 By = maxy x’ By. We encourage the 
reader to find a mixed Nash equilibrium in the following variant of 
the modified rock-paper-scissors game played by Santa Claus and the 
Easter Bunny: As before, the loser pays €1 to the winner, but in case 
of a draw, each player donates €0.50 to charity. 

The problem of finding a Nash equilibrium in a bimatrix game 
cannot be formulated as a linear program, and no polynomial-time 
algorithm is known for it. On the other hand, a Nash equilibrium can 
be computed by a variant of the simplex method, called the Lemke- 
Howson algorithm (but possibly with exponentially many pivot steps). 

In general, Nash equilibria in bimatrix games are not as satisfactory 
as in zero-sum games, and there is no such thing as “the” game value. 
We know that in a Nash equilibrium, no player has an incentive to 
unilaterally switch to a different behavior. Yet, it may happen that 
both can increase their payoff by switching simultaneously, a situation 
that obviously cannot occur in a zero-sum game. This means that 
the Nash equilibrium was not optimal from the point of view of social 
welfare, and no player has a real desire of being in this particular Nash 
equilibrium. It may even happen that all Nash equilibria are of this 
suboptimal nature. Here is an example. 

At each of the two department stores in town, All the Best Deals 
and Buyer’s Paradise, the respective owner needs to decide whether to 
launch an advertisement campaign for the upcoming Christmas sale. If 
one store runs a campaign while the competitor doesn’t, the expected 
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extra revenue obtained from customers switching their preferred store 
(€50,000, say) and new customers (€ 10,000) easily outweighs the 
cost of the campaign (€ 20,000). If, on the other hand, both stores 
advertise, let us assume that the campaigns more or less neutralize 
themselves, with extra revenue coming only from new customers in an 
almost saturated market (€ 8,000 for each of the stores). 

Listing the net revenues a;;,b;; as pairs, in units of €1,000, we 
obtain the following matrix, with rows corresponding to the strategies 
of All the Best Deals, and columns to Buyer’s Paradise. 


advertise don’t advertise 


advertise (—12, —12) (40, —50) 
don’t advertise | (—50, 40) (0, 0) 


If the store owners were friends, they might agree on running no 
campaign in order to save money (they’d better keep this agreement 
private in order to avoid a price-fixing charge). But if they do not 
communicate or mistrust each other, rational behavior will force both 
of them to waste money on campaigns. To see this, put yourself in the 
position of one of the store owners. Assuming that your competitor 
will not advertise, you definitely want to advertise in order to profit 
from the extra revenue. Assuming that the other store will advertise, 
you must advertise as well, in order not to lose customers. This means 
that you will advertise in any case. We say that the strategy “adver- 
tise” strictly dominates the strategy “don’t advertise,” so it would be 
irrational to play the latter. 

Because the other store owner reaches the same conclusion, there 
will be two almost useless campaigns in the end. In fact, the pair of 
strategies (advertise, advertise) is the unique Nash equilibrium of the 
game (mixing does not help), but it is suboptimal with respect to 
social welfare. 

In general bimatrix games, the players might not be able to reach 
the social optimum through rational reasoning, even if this optimum 
corresponds to an equilibrium of the game. This is probably the most 
serious deficiency of bimatrix games as models of real-world situations. 
An example is the battle of the sexes. A couple wants to go out at 
night. He prefers the boxing match, while she prefers the opera, but 
both prefer going together over going alone. If both are to decide 
independently where to go, there is no rational way of reaching a 
social optimum (namely both going out together, no matter where). 

When the advertisement game is played repeatedly (after all, there 
is a Christmas sale every year), the situation changes. In the long run, 
wasting money every year is such a bad prospect that the following 
more cooperative behavior makes sense: In the first year, refrain from 
advertising, and in later years just do what the competitor did the 
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year before. This strategy is known as TIT FOR TAT. If both stores 
adopt this policy, they will never waste money on campaigns; but even 
if one store deviates from it, there is no possibility of exploiting the 
competitor in the long run. It is easy to see that after a possible first 
loss, the one playing TIT FOR TAT can “pay” for any further loss, 
due to a previous loss of the competitor. 


The Prisoner’s Dilemma. The advertisement game is a variation of 
the well-known prisoner’s dilemma, in which two convicts charged 
with a crime committed together are independently offered (somewhat 
unethical) plea bargains; if both stay silent, a lack of evidence will lead 
to only minor punishment. If each testifies against the other, there will 
be some bearable punishment. But if—and this is the unethical part— 
exactly one of the two testifies against the other, the betrayer will be 
rewarded and set free, while the one that remains silent will receive 
a heavy penalty. As before, rational behavior will force both convicts 
to testify against each other, even if they had nothing to do with the 
crime. 

A popular introduction to the questions surrounding the prisoner’s 
dilemma is 


W. Poundstone: Prisoner’s Dilemma: John von Neumann, 
Game Theory, and the Puzzle of the Bomb, Doubleday, New 
York 1992 


(and the 1964 Stanley Kubrick movie Dr. Strangelove or: How I 
Learned to Stop Worrying and Love the Bomb might also widen one’s 
horizons in this context). 

A general introduction to game theory is 


J. Bewersdorff: Luck, Logic, and White Lies, AK Peters, 
Wellesley 2004. 


This book also contains the references to the work by Borel and von 
Neumann that we have mentioned above. 


8.2 Matchings and Vertex Covers in Bipartite 
Graphs 


Let us return to the job assignment problem from Section 3.2. There, the 
human resources manager of a company was confronted with the problem of 
filling seven positions with seven employees, where every employee has a score 
(reflecting qualification) for each position he or she is willing to accept. We 
found that the manager can use linear programming to find an assignment 
of employees to positions (a perfect matching) that maximizes the sum of 


8.2 Matchings and Vertex Covers in Bipartite Graphs 143 


scores. We have also promised to show how the manager can fill the positions 
optimally if there are more employees than positions. 

Here we will first disregard the scores and ask under what conditions 
the task has any solution at all, that is, whether there is any assignment of 
positions to employees such that every employee gets a position she or he is 
willing to accept and each position is filled. We know that we can decide this 
by linear programming (we just invent some arbitrary scores, say all scores 
100), but here we ask for a mathematical condition. 

For example, let us consider the graph from Section 3.2 but slightly mod- 
ified (some of the people changed their minds): 


A B C D E F G 


q r s t u Vv Ww 


There is no assignment filling all positions in this situation. This is not imme- 
diately obvious, but it becomes obvious once we look at the set {r,s,t,u, w} 
of jobs (marked). Indeed, the set of people willing to take any of these 5 
jobs is {C, D, E,G}. This is only 4 people, and so they cannot be assigned 
to 5 different jobs. 

The next theorem, known as Hall’s theorem or the marriage theorem, 
states that if no assignment exists, we can always find such a simple “reason”: 
a subset of & jobs such that the total number of employees willing to take 
any of them is smaller than k. 

Before we formally state and prove this theorem, in the language of bi- 
partite graphs, we need to recall the notions of maximum matching (Section 
3.2) and minimum vertex cover (Section 3.3). 

A matching in a graph G = (V,E) is a set E’ C E of edges with the 
property that each vertex is incident to at most one edge in E’. A matching 
is maximum if it has the largest number of edges among all matchings in G. 

A vertex cover of G = (V, E) isa set V’ C V of vertices with the property 
that each edge is incident to at least one vertex in V’. A vertex cover is 
minimum if it has the smallest number of vertices among all vertex covers 
of G. 

Hall’s theorem gives necessary and sufficient conditions for the existence 
of a best possible matching in a bipartite graph, namely a matching that 
covers all vertices in one class of the vertex bipartition. 
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8.2.1 Theorem (Hall’s theorem). Let G = (V,E) be a bipartite graph 
with bipartition V = X UY. For a set T C X, we define the neighborhood 
N(L) CY ofT as the set 


N(T) ={w €Y: {v,w} € E for some v € Th. 


If for every T C X, |N(T)| > |T| holds, then G has a matching that covers 
all vertices in X. 


Actually, we derive the following statement, from which Hall’s theorem 
easily follows. 


8.2.2 Theorem (KG6nig’s theorem). Let G = (V, E) be a bipartite graph. 
Then the size of a maximum matching in G equals the size of a minimum 
vertex cover of G. 


We prove this theorem using the duality of linear programming. There 
are also combinatorial proofs, and they are actually simpler than the proof 
offered here. There are at least two reasons in favor of the proof via duality: 
First, it is a simple example illustrating a powerful general technique. And 
second, it gives more than just Konig’s theorem. It shows that a maximum 
matching, as well as a minimum vertex cover, in a bipartite graph can be 
computed by linear programming. Moreover, the method can be extended to 
computing a maximum-weight matching in a bipartite graph with weights on 
the edges. 

Let us first see how Hall’s theorem follows from Konig’s theorem. 


Proof of Theorem 8.2.1. Let G = (X UY, £) be a bipartite graph with 
|N(T)| > |Z| for all T C X. We will show that any minimum vertex cover 
of G has size n, = |X|. K6nig’s theorem then implies that G has a matching 
of size n,, and this matching obviously covers X. 

For contradiction, suppose that there is a vertex cover C' with k vertices 
from X and fewer than n; —k vertices from Y, for some k. The set T = X\C 
has size nj — k and satisfies |N(T)| > n1 — k by the assumption. But this 
implies that there is a vertex w € N(T) that is not in CNY. Since this 
vertex has some neighbor v € T, the edge {v,w} is not covered by C, a 
contradiction. 


Totally unimodular matrices. A matrix A is called totally unimodular 
if every square submatrix of A (obtained from A by deleting some rows and 
some columns) has determinant 0, 1, or —1. We note that, in particular, the 
entries of A can be only 0, —1, and +1. 

Such matrices are interesting since an integer program with a totally uni- 
modular constraint matrix is easy to solve—it suffices to solve its LP relax- 
ation, as we will show in Lemma 8.2.4 below. Let us start with a preparatory 
step. 
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8.2.3 Lemma. Let A be a totally unimodular matrix, and consider the ma- 
trix A obtained from A by appending a unit vector e; as a new last column. 
Then A is totally unimodular as well. 


Proof. Let us fix an @ x @ submatrix Q of A. If Q is a submatrix of A, 
then det(Q) € {—1,0,1} by total unimodularity of A. Otherwise, we pick 
the column ¢ of Q that corresponds to the newly added column in A, and 
we compute det(Q) according to the Laplace expansion on this column. We 


recall that : 


det(Q) = So (-1)* aie det(Q"*), 


i=l 


where Q” is the matrix resulting from Q by removing row 7 and column 0. 
By construction, the th column may be 0 (and we get det(Q) = 0), or there 
is exactly one nonzero entry qx¢ = 1. In that case, 


det(Q) = (—1)*** det(Q"*) € {-1,0, 1}, 


since Q* is a submatrix of A. 


The following lemma is the basis of using linear programming for solving 
integer programs with totally unimodular matrices. 


8.2.4 Lemma. Let us consider a linear program with n nonnegative vari- 
ables and m inequalities of the form 


maximize c!x 


subject to Ax<b 
x > 0, 


where b € Z"™. If A is totally unimodular, and if the linear program has an 
optimal solution, then it also has an integral optimal solution x* € Z”. 


Proof. We first transform the linear program into equational form. The 
resulting system of equality constraints is AX = b, with A = (A| Jj) and 
x € R”™*™. Then we solve the problem using the simplex method. Let x* 
be an optimal basic feasible solution, associated with the feasible basis B C 
{1,2,...,2 +m}. Then we know that the nonzero entries of X* are given by 


see Section 5.5. 

By Cramer’s rule, the entries of Ag can be written as rational numbers 
with common denominator det(Ag). The matrix Ag is a square submatrix 
of A, where A is totally unimodular (by repeated application of Lemma 8.2.3). 
Since Ag is nonsingular, we get det(Ag) € {—1,1}, and the integrality of ¥* 
follows. 
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Incidence matrices of bipartite graphs. Here is the link between total 
unimodularity and Kénig’s theorem. Let G = (X U Y, E) bea bipartite graph 
with n vertices v1,...,Un and m edges e1,...,@€m. The incidence matrix 
of G is the matrix A € R”*™ defined by 


eS 1 ifv;€ ej 
‘I~ | 0 otherwise. 


8.2.5 Lemma. Let G = (X U Y,E) be a bipartite graph. The incidence 
matrix A of G is totally unimodular. 


Proof. We need to prove that every €xf submatrix Q of A has determinant 
0 or +1, and we proceed by induction on @. The case @ = 1 is immediate, 
since the entries of an incidence matrix are only 0’s and 1’s. 

Now we consider @ > 1 and an @xé@ submatrix Q. Since the columns 
of Q correspond to edges, each column of Q has at most two nonzero entries 
(which are 1). If there is a column with only zero entries, we get det(Q) = 0, 
and if there is a column with only one nonzero entry, we can expand the 
determinant on this column (as in the proof of Lemma 8.2.3) and get that 
up to sign, det(Q) equals the determinant of an (¢—1)x(¢—1) submatrix Q’. 
By induction, det(Q’) € {—1,0,1}, so the same holds for Q. 

Finally, if every column of Q contains precisely two 1’s, we claim that 
det(Q) = 0. To see this, we observe that the sum of all rows of Q correspond- 
ing to vertices in X is the row vector (1,...,1), since for each column of Q, 
exactly one of its two 1’s comes from a vertex in X. For the same reason, we 
get (1,...,1) by summing up the rows for vertices in Y, and it follows that 
the rows of @ are linearly dependent. 


Now we are ready to prove K6nig’s theorem. 


Proof of Theorem 8.2.2. We first consider the integer program 
maximize 0%", 2; 
subject to Ax <1 
x >0 
xEZ™, 


where A is the incidence matrix of G. In this integer program, the row of A 
corresponding to vertex v; induces the constraint 


S- Xj < 1. 
Ji€j DVi 


This implies that x; € {0,1} for all j, and that the edges e; with @; = 1 in 
an optimal solution x form a maximum matching in G. 
Next we consider the integer program 
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minimize So", yi 

subject to ATy >1 
y20 
ye Z", 


where A is as before the incidence matrix of G. In this integer program, the 
row of A? corresponding to edge e; induces the constraint 


x yi = 1. 


1:05 € ej 


This implies that in any optimal solution y we have 4; € {0,1} for all 2, since 
any larger value could be decreased to 1. But then the vertices v; with y; = 1 
in an optimal solution y form a minimum vertex cover of G. 

To summarize, the optimum value of the first integer program is the size 
of a maximum matching in G, and the optimum value of the second integer 
program is the size of a minimum vertex cover in G. 

In both integer programs, we may now drop the integrality constraints 
without affecting the optimum values: A, and therefore also A’, are totally 
unimodular by Lemma 8.2.5, and so Lemma 8.2.4 applies. But the resulting 
linear programs are dual to each other; the duality theorem thus shows that 
their optimal values are equal, and this proves Theorem 8.2.2. 


It remains to explain the algorithmic implications of the proof (namely, 
how a maximum matching and a minimum vertex cover can actually be 
computed). To get a maximum matching, we simply need to find an integral 
optimal solution of the first linear program. When we use the simplex method 
to solve the linear program, we get this for free; see the proof of Lemma 8.2.4 
and, in particular, the claim toward the end of its proof. Otherwise, we can 
apply Theorem 4.2.3(ii) to construct a basic feasible solution from any given 
optimal solution, and this basic feasible solution will be integral. A minimum 
vertex cover is obtained from the second (dual) linear program in the same 
fashion. 

The previous arguments show more: Given edge weights w1,...,Wm, any 
optimal solution of the integer program 


maximize D0'" | wja; 
subject to Ax <1 

x >0 

x (= “ie 


corresponds to a mazimum-weight matching in G. Since we can, as before, 
relax the integrality constraints without affecting the optimum value, an inte- 
gral optimal solution of the relaxation can be found, and it yields a maximum- 
weight matching in G. This solves the optimal job assignment problem if there 
are more employees than jobs. 
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The fact that a linear program with totally unimodular constraint 
matrix and integral right-hand side has an integral optimal solution 
implies something much stronger: Since every vertex of the feasible 
region is optimal for some objective function (see Section 4.4), we 
know that all vertices of the feasible region are integral. We say that 
the feasible region forms an integral polyhedron. 

Such integrality results together with linear programming duality 
can yield interesting and powerful minimax theorems. Konig’s theorem 
is one such example. Another classical minimax theorem that can be 
proved along these lines is the maz-flow-min-cut theorem. 

To state this theorem, we consider a network modeled by a directed 
graph G = (V,£) with edge capacities we. In Section 2.2, we have 
interpreted them as maximum transfer rates of data links. Given two 
designated vertices, the source s and the sink t, the mazimum flow 
value is the maximum rate at which data can flow from s to t through 
the network. 

The minimum cut value, on the other hand, is the minimum total 
capacity of any set of data links whose breakdown disconnects t from s. 

The max-flow-min-cut theorem states that the maximum flow 
value is equal to the minimum cut value. One of several proofs writes 
both values as optimal values of linear programs that are dual to each 
other. 

When we consider matchings and vertex covers in general (not nec- 
essarily bipartite) graphs, the situation changes: Total unimodularity 
no longer applies, and the “duality” between the two concepts disap- 
pears. 

In fact, the problem of finding a minimum vertex cover in a gen- 
eral graph is computationally difficult (NP-hard); see Section 3.3. 
A maximum-weight matching, on the other hand, can still be com- 
puted in polynomial time for general graphs, although this result is 
by no means trivial. Behind the scenes, there is again an integrality 
result, based on the notion of total dual integrality; see the glossary. 


8.3 Machine Scheduling 


In the back office of the copy shop Copy & Paste, the operator is confronted 
with n copy jobs submitted by customers the night before. For processing 
them, she has m photocopying machines with different features at her dis- 
posal. For all 7,7, the operator quickly estimates how long it would take the 
ith machine to process the jth job, and she makes a table of the resulting 
running times, like this: 
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Single Duplex Duplex 
B&W B&W Color 


ei oseveme | — | mn | am 
RTT ee pie 5h 30 min 
se 100) Bait oe —— 
oe a B&W copies =f amin | 3 min 


Party platform, 10 pages 
two-sided, 5,000 color copies 

Since the operator can go home as soon as all jobs have been processed, 
her goal is to find an assignment of jobs to machines (a schedule) such that 
the makespan—the time needed to finish all jobs—is minimized. In our 
example, this is not hard to figure out: For the party platform, there is no 
choice between machines. We can also observe that it is advantageous to 
use both B&W machines for processing the two flyers, no matter where the 
thesis and the obituary go. Given this, the makespan is at least 4h 55 min if 
the thesis is processed on the Duplex B&W machine, so it is better put on 
the color machine to achieve the optimum makespan of 4h 30 min (with the 
obituary running on the B&W machine). 

In general, finding the optimum makespan is computationally difficult 
(NP-hard). The obvious approach of trying all possible schedules is of course 
not a solution for a larger number n of jobs. What we show in this section is 
that the operator can quickly compute a schedule whose makespan is at most 
twice as long as the optimum makespan. All she needs for that are some linear 
programming skills. (To really appreciate this result, one shouldn’t think of 
a problem with 5 jobs but with thousands of jobs.) 

We should emphasize that in this scheduling problem the jobs are consid- 
ered indivisible, and so each job must be processed on a single machine. This, 
in a sense, is what makes the problem difficult. As we will soon see, an opti- 
mal “fractional schedule,” where a single job could be divided among several 
machines in arbitrary ratios, can be found efficiently by linear programming. 


Two integer programs for the scheduling problem. Let us identify 
the m machines with the set M := {1,...,m} and the n jobs with the set 
J:={m+1,...,m+n}. 

Let d;; denote the running time of job 7 € J on machine i € M. We 
assume d;; > 0. To simplify notation, we also assume that any machine 
can process any job: An infeasible assignment of job 7 to machine 7 can be 
modeled by a large running time d;; = K. If K is larger than the sum of all 
“real” running times, the optimal schedule will avoid infeasible assignments, 
given that there is a feasible schedule at all. 
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With these notions, the following integer program in the variables t and 
xij, 1 € M, j € J computes an assignment of jobs to machines that minimizes 
the makespan: 


Minimize t 
subject to Diem tj = 1 foralljeJ 
eg dij X45 Sout for allie M 
xz 2 0 foralie M,jeJ 
vj € Z foralie M,jeJ. 
Under the integrality constraints, the conditions )),-,,vij = 1 and 


xij > O imply that x;; € {0,1} for all 7,7. With the interpretation that 
xij = 1 if job 7 is assigned to machine i, and x;; = 0 otherwise, the first 
n equations stipulate that each job is assigned to exactly one machine. The 
next m inequalities make sure that no machine needs more time than t to 
finish all jobs assigned to it. Minimizing ¢t leads to equality for at least one of 
the machines, so the best t is indeed the makespan of an optimal schedule. 

As we have already seen in Section 3.3 (the minimum vertex cover prob- 
lem), solving the LP relaxation obtained from an integer program by deleting 
the integrality constraints can be a very useful step toward an approximate 
solution of the original problem. In our case, this approach needs an ad- 
ditional twist: We will relax another integer program, obtained by adding 
redundant constraints to the program above. After dropping integrality, the 
added constraints are no longer redundant and lead to a better LP relaxation. 

Let topt be the makespan of an optimal schedule. If dj; > topt, then we 
know that job 7 cannot run on machine 7 in any optimal schedule, so we 
may add the constraint x;; = 0 to the integer program without affecting its 
validity. More generally, if Tis any upper bound on topt, we can write down 
the following integer program, which has the same optimal solutions as the 
original one: 


Minimize t 
subject to Miemtyg = 1 foralljeJ 
Dyes Uj Li; <t for allie M 
Liz > 0 forallie M,jeJ 
Liz = 0 for alli €¢ M,j € J with dj; > T 
vj € Z forallie Mj ec J. 


But we do not know topt, so what is the value of T we use for the relax- 
ation? There is no need to specify this right here; for the time being, you can 
imagine that we set 7’ = max,; d;;, a value that will definitely work, because 
it makes our second integer program coincide with the first one. 

A good schedule from the LP relaxation. As already indicated, the 
first step is to solve the LP relaxation, denoted by LPR(T), of our second 
integer program: 


8.3 Machine Scheduling 151 


Minimize t 
subject to Diem tj = 1 forallgcJ 
ies digi; <t for allie M 
Vij > 0 for allie M,j Ee J 
viz = 0 for alli € M,j € J with di; >T. 


In contrast to the vertex cover application, we cannot work with any optimal 
solution of the relaxation, though: We need a basic feasible optimal solution; 
see Section 4.2. To be more precise, we rely on the following property of a 
basic feasible solution; see Theorem 4.2.3. 


8.3.1 Assumption. The columns of the constraint matrix A corresponding 
to the nonzero variables x}; in the optimal solution of LPR(T)) are linearly 
independent. 


As usual, the nonnegativity constraints 7;; > 0 do not show up in A. In 
case the simplex method is used to solve the relaxation, such a solution comes 
for free (the column of A corresponding to t could then actually be added 
to the set of columns in the assumption, but this is not needed). Otherwise, 
we can easily construct a solution satisfying the assumption from any given 
optimal solution, according to the recipe in the proof of Theorem 4.2.3. 


At this point the reader may wonder why we would want to use an 
algorithm different from the simplex method here, in particular when 
we are searching for a basic feasible optimal solution. The reason is of 
theoretical nature: We want to prove that a schedule whose makespan 
is at most twice the optimum can be found in polynomial time. As we 
have pointed out in the introductory part of Chapter 7, the simplex 
method is not known to run in polynomial time for any pivot rule, 
and for most pivot rules it simply does not run in polynomial time. 
For the theoretical result we want, we had therefore better use one 
of the polynomial-time methods for solving linear programs, sketched 
in Chapter 7. For practical purposes, the simplex method will do, of 
course. 

In general terms, what we are trying to develop here is a poly- 
nomial-time approximation algorithm for an NP-hard problem. Since 
complexity theory indicates that we will not be able to solve the prob- 
lem exactly within reasonable time bounds, it is quite natural to ask 
for an approximate solution that can be obtained in polynomial time. 
The quality of an approximate solution is typically measured by the 
approximation factor, the ratio between the value of the approximate 
solution and the value of an optimal solution. In our approximation 
algorithm for the scheduling problem, this factor will be at most 2. 


Let us fix the values t* and 2}, of the variables in some optimal solution 
of the LP relaxation. We now consider the bipartite graph G = (MU J, E), 
with 
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E={{i,j} © MUJ| 2%, > 0}. 


For an arbitrary optimal solution, this graph could easily be a (boring) com- 
plete bipartite graph, but under Assumption 8.3.1 it becomes more interest- 
ing. 


8.3.2 Lemma. In any subgraph of G (obtained by deleting edges, or vertices 
with their incident edges), the number of edges is at most the number of 
vertices. 


The graph G for m = 4 machines and n = 6 jobs might look like this, for 
example: 


M 


dt 


Proof. Let A be the constraint matrix of the LP relaxation. It has one row 
for each machine, one for each job, and one for each runtime d;; exceeding T’. 
The columns of A corresponding to the nonzero variables Li (equivalently, 
to the edges of G) are linearly independent by our assumption. 

Now we consider any subgraph of G, with vertex set M’UJ’ C MUJ and 
edge set E’ C E. Let A’ be the submatrix obtained from A by restricting to 
rows corresponding to M’U J’, and to columns corresponding to E’. 


Claim. The columns of A’ are linearly independent. 


To see this, we first observe that the columns of A corresponding 
to E’ are linearly independent, simply because E’ C FE. Any variable 
xij, {t,7} € E", occurs in the inequality for machine i € M’ and in 
the equation for job j € J’, but in no other equation, since «7, > 0 
implies dj; < T. This means that the columns of A corresponding 
to E’ have zero entries in all rows except for those corresponding to 
machines or jobs in M’U J’. Hence, these columns remain linearly 
independent even when we restrict them to the rows corresponding 
to M’U J’. 


By the claim, we have |E’| < |M’U J’|, and this is the statement of the 
lemma. 


This structural result about G allows us to find a good schedule. 


8.3.3 Lemma. Let T > 0 be such that the LP relaxation LPR(T) is feasible, 
and suppose that a feasible solution is given that satisfies Assumption 8.3.1 
and has valuet = t*. Then we can efficiently construct a schedule of makespan 
at most t* + T. 
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Proof. We need to assign each job to some machine. We begin with the 
jobs 7 that have degree one in the graph G, and we assign each such 7 to its 
unique neighbor 7. By the construction of G and the equation }),-,, tj = 1, 
we have x;,; = 1 in this case. If machine 7 has been assigned a set S; of jobs 
in this way, it can process these jobs in time 


De dig = DF digi; < Do agai; <1. 


JES: JES; JET 


So each machine can handle the jobs assigned by this partial schedule in 
time ¢*. 

Next we remove all assigned jobs and their incident edges from G. This 
leaves us with a subgraph G’ = (MU J’, E’). In G’, all vertices of degree one 
are machines. In the example depicted above, two jobs have degree one, and 
their deletion results in the following subgraph G’: 


M 
G! 
J’ 


We will show that we can find a matching in G’ that covers all the remain- 
ing jobs. If we assign the jobs according to this matching, every machine gets 
at most one additional job, and this job can be processed in time at most T 
by the construction of our second integer program. Therefore, the resulting 
full schedule has makespan at most t* + T. 

It remains to construct the matching. To this end, we use Hall’s theorem 
from Section 8.2. According to this theorem, a matching exists if for every 
subset J” C J’ of jobs, its neighborhood (the set of all machines connected 
to at least one job in J’’) has size at least |J”|. 

To check this condition, we let J’ C J’ be such a subset of jobs and N(J”) 
its neighborhood. If e is the number of edges in the subgraph of G’ induced 
by J” UN(J”), then Lemma 8.3.2 guarantees that e < |J” UN(J”)|. On the 
other hand, since every job has at least two neighbors, we have e > 2|J”|, 
and this shows that |N(J”)| > |J”’|. 

Although this proof is nonconstructive, we can easily find the matching 
(once we know that it exists) by linear programming as in Section 8.2, or by 
other known polynomial-time methods. 


There is a direct way of constructing the matching in the proof of 
Lemma 8.3.3 that relies neither on Halls’s theorem nor on general (bi- 
partite) matching algorithms. It goes as follows: Lemma 8.3.2 implies 
that each connected component of G’ is either a tree, or a tree with 
one extra edge connecting two of its vertices. In the latter case, the 
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component has exactly one cycle of even length, because G’ is bipar- 
tite. Therefore, we can match all jobs occurring on cycles, and after 
removing the vertices of all cycles, we are left with a subgraph G”, all 
of whose connected components are trees, with at most one vertex per 
tree being a former neighbor of a cycle vertex. It follows that in every 
tree of G”’, at most one vertex can be a job of degree one, since all 
other degree-one vertices already had degree one in G’ and are there- 
fore machines. 

The matching of the remaining jobs is easy. We root any tree in 
G” at its unique job of degree one (or at any vertex if there is no such 
job), and we match every job to one of its children in the rooted tree. 
For this, we observe that there cannot be an isolated job in G”’: Since 
a job in G” was on no cycle in G’, the removal of cycles can affect only 
one of the at least two neighbors of the job in G’. 


For our running example, here is a complete assignment of jobs to ma- 
chines obtained from the described procedure: 


M e e td @ 


J 


Choosing the parameter T. How good is the bound we get from the pre- 
vious lemma? We will assume that t* is the value of an optimal basic feasible 
solution of the LP relaxation with parameter T. Then t* is a lower bound 
for the optimum makespan ftopt, so this part of the bound looks promising. 
But when we recall that J’ must be an upper bound for topt in order for our 
second integer program to be valid, it seems that we would have to choose 
T = topt to get makespan at most 2top¢. But this cannot be done, since we 
have argued above that it is hard to compute top¢ (and if it were easy, there 
would be no need for an approximate solution anyway). 

Luckily, there is a way out. Reading Lemma 8.3.3 carefully, we see that 
T only needs to be chosen so that the LP relaxation LPR(T) is feasible, and 
there is no need for the second integer program to be feasible. 

If LPR(T) is feasible, Lemma 8.3.3 allows us to construct a schedule with 
makespan at most t*+7’, so the best T is the one that minimizes t*+T subject 
to LPR(T) being feasible. Since ¢* depends on T, we make this explicit now 
by writing t* = ¢*(T). If LPR(T) is infeasible, we set t*(T) = co. 

How to find the best T. We seek a point T* in which the function f(T) = 
t“(T)+T attains a minimum. 

First we observe that t*(T) is a step function as in the following picture: 
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Indeed, let us start with the value T = max;;dj;, and let us decrease T’ 
continuously. The value of t*(7’) may change (jump up) only immediately 
after moments when a new constraint of the form z;; = 0 appears in LPR(T), 
and this happens only for T = d,;. Between these values the function t*(T’) 
stays constant. 

Consequently, the function f(T) = t*(Z) + T is linearly increasing on 
each interval between two consecutive d;;’s, and the minimum is attained at 
some d;;. So we can compute the minimum, and the desired best value 7”, 
by solving at most mn linear programs of the form LPR(T), with T ranging 
over all d;;. Under our convention that t*(T’) = oo if LPR(T) is infeasible, 
the minimum will be attained at a value T* with LPR(T™) feasible. 


8.3.4 Theorem. Let T* be the value of T that minimizes t*(T) + T. With 
T = 7", the algorithm in the proof of Lemma 8.3.3 computes a schedule of 
makespan at most 2topt- 


Proof. We know that for T’ = topt, the second integer program is feasible and 
has optimum value top. Hence LPR(topt) is feasible as well and its optimum 
value can be only smaller: t*(topt) < topt. We thus have 


(T*)4+T* = mine (e(7) A: ‘p 
t (topt) +E topt 
Diss. 


IA IA 


The 2-approximation algorithm for the scheduling problem is 
adapted from the paper 


J. K. Lenstra, D. B. Shmoys, E. Tardos: Approximation algo- 
rithms for scheduling unrelated parallel machines, Mathemat- 
ical Programming 46(1990), pages 259-271. 
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The paper also proves that it is NP-hard to approximate the optimum 
makespan with a factor less than 3. 

Aiming at simplicity, we have presented a somewhat inefficient 
version of the algorithm. A considerably faster version, with the same 
approximation guarantee, can be obtained if we do not minimize the 
function T+ ¢*(T) + T, but instead we only look for the smallest T 
with ¢*(T’) < T. Such a value of T still guarantees t*(T) + T < 2topt, 
but it can be found by binary search over the sorted sequence of the d;;. 
In this way, it suffices to solve LPR(T) for O(logmn) values of T, 
instead of mn values as in our presentation. See the paper quoted 
above for details. 


8.4 Upper Bounds for Codes 


Error-correcting codes. Let us consider a DVD player that has a remote 
control unit with 16 keys. Whenever one of the keys is pressed, the unit needs 
to communicate this event to the player. 

A natural option would be to send a 4-bit sequence: Since there are 2+ = 
16 different 4-bit sequences (referred to as “words” in this context), a 4-bit 
sequence is an economical way of communicating one of 16 possibilities. 

However, let us suppose that the transmission of bits from the remote 
control to the player is not quite reliable, and that each of the transmitted 
bits can be received incorrectly with some small probability, say 0.005. Then 
we expect that about 2% of the commands are received incorrectly, which 
can be regarded as a rather serious flaw of the device. 

One possibility of improvement is to triple each of the four transmitted 
bits. That is, instead of a 4-bit word abcd the unit sends the 12-bit word 
aaabbbcccddd. Now a transmission error in a single bit can be recognized 
and corrected. For example, if the sequence 111001000111 is received, and 
if we assume that at most one bit was received erroneously, it is clear that 
111000000111 must have been sent. Thus the original 4-bit sequence was 
1001. Of course, it might be that actually two or more bits are wrong, and 
then the original sequence is not reconstructed correctly, but this has much 
lower probability. Namely, if we assume that the errors in different bits are 
independent and occur with probability 0.005 (which may or may not be 
realistic, depending on the technical specifications), then the probability of 
two or more errors in a 12-bit sequence is approximately 0.16%. This is a 
significant improvement in reliability. However, the price to pay is transmit- 
ting three times as many bits, which presumably exhausts the battery of the 
remote control much faster. 

A significantly better solution to this problem was discovered by Richard 
Hamming in the 1950s (obviously not in the context of DVD players). In 
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order to distinguish 16 possibilities, we send one of the following 7-bit words: 
0000000, 0001011, 0010101, 0011110, 0100110, 0101101, 0110011, 0111000, 
1000111, 1001100, 1010010, 1011001, 1100001, 1101010, 1110100, 1111111. It 
can be checked that every two of these words differ in at least 3 bits. (This 
fact can be checked by brute force, but it is also a consequence of an elegant 
general theory, which we do not treat here.) Therefore, if one error occurs in 
the transmission, the sequence that was sent can be reconstructed uniquely. 
Hence the capability of correcting any single-bit error is retained, but the 
number of transmitted bits is reduced to slightly more than half compared 
to the previous approach. 

A similar problem can be investigated for other settings of the parameters 
as well. In general, we want to communicate one of N possibilities, we do it 
by transmitting an n-bit word, and we want that any at most r errors can 
be corrected. 

This problem has an enormous theoretical and practical significance. Our 
example with a DVD player was simple-minded, but error-correcting codes 
play a role in any technology involving transmission or storage of information, 
from computer disks and cell phones to deep-space probes. We now introduce 
some common terminology related to error-correcting codes. 


Terminology. The Hamming distance of two words w, w’ € {0,1}” is the 
number of bits in which w differs from w’: 


dy(w,w’) :=|{j € {1,...,n}: wy Aw}. 


The Hamming distance can be interpreted as the number of errors “neces- 
sary” to transform w into w’. The weight of w € {0,1}” is the number of 
1’s in w: 

|lw| := |{7 € {1,...,n}: w; = 1}. 


Finally, for w, w’ € {0,1}”, we define their sum modulo 2 as the word 
w Ow = ((wi + w}) mod2,..., (wn + wi,) mod 2) € {0,1}”. 
These three notions are interrelated by the formula 
di(w,w') = |w ew (8.6) 


In the last of the solutions to the DVD-player problem discussed above, 
the crucial object was the set C = {0000000, 0001011, 0010101, 0011110, 
0100110, 0101101, 0110011, 0111000, 1000111, 1001100, 1010010, 1011001, 
1100001, 1101010, 1110100, 1111111} of 7-bit words in which every two dis- 
tinct words had Hamming distance at least 3. In coding theory, any subset 
C C {0,1}” is called a code. (This may sound strange, since under a code one 
usually imagines some kind of method or procedure for coding, but in the 
theory of error-correcting codes one has to get used to this terminology.) For 
correcting errors, the crucial parameter is the distance of the code: 
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8.4.1 Definition. A code C C {0,1}” has distance d if dy(w,w’) > d 
for any two distinct words w,w’ in C. For n,d > 0, let A(n,d) denote the 
maximum cardinality of a code C C {0,1}” with distance d. 


We claim that a code C can correct any at most r errors if and only if it has 
distance at least 2r + 1. Indeed, on the one hand, if C contained two distinct 
words w’, w” that differ in at most 2r bits, we consider any word w resulting 
from w’ by flipping exactly half of the bits (rounded down) that distinguish 
w’ from w”. When the word w is received, there is no way to tell which of 
the words w’ and w” was intended. On the other hand, if any two distinct 
code words differ by at least 2r +1 bits, then for any word w € {0,1}” there 
is at most one code word from which w can be obtained through r or fewer 
errors, and this must be the word that was sent when w is received. 

Given the number n of bits we can afford to transmit and the number 
r of errors we need to be able to correct, we want a code C C {0,1}” with 
distance at least 2r + 1 and with |C| as large as possible, since the number 
of words in the code corresponds to the amount of information that we can 
transmit. Thus determining or estimating A(n, d), the maximum possible size 
of a code C C {0,1}” with distance d, is one of the main problems of coding 
theory. 

The problem of finding the largest codes for given n and r can in principle 
be solved by complete enumeration: We can go through all possible subsets of 
{0, 1}” and output the largest one that gives a code with distance d. However, 
this method becomes practically infeasible already for very small n. It turns 
out that the problem, for arbitrary n and d, is computationally difficult (NP- 
hard). Starting from very moderate values of n and d, the maximum code 
sizes are not exactly known, except for few lucky cases. Tightening the known 
upper and lower bounds on maximum sizes of error-correcting codes is the 
topic of ongoing research in coding theory. 

In this section we present a technique for proving upper bounds based 
on linear programming. When this technique was introduced by Philippe 
Delsarte in 1973, it provided upper bounds of unprecedented quality. 


Special cases. For all n, we have A(n,1) = 2”, because any code has dis- 
tance 1. The case d = 2 is slightly more interesting. By choosing C as the set 
of all words of even weight, we see that A(n,2) > 2”~'. But we actually have 
A(n,2) = 2”~1, since it is easy to show by induction that every code with 
more than 2”~! words contains two words of Hamming distance 1. 

Given the simplicity of the cases d = 1,2, it may come as a surprise that 
already for d = 3, little is known. This is the setup for error-correcting codes 
with one error allowed. Exact values of A(n,3) have been determined only 
up to n < 16; for n = 17, for example, the known bounds are 


5312 < A(17,3) < 6552. 


8.4 Upper Bounds for Codes 159 


The sphere-packing bound. For any n and d, a simple upper bound 
on A(n,d) can be obtained by a volume argument. Let us motivate this 
with a real-life analogy. The local grocery is exhibiting a large glass box 
filed with peas, and the person to make the most accurate estimate 
of the number of peas in the box wins a prize. Without any counting, 
you can conclude that the number of peas is bounded above by the 
volume of the box divided by the volume of a single pea (assuming 
that all the peas have the same volume). 

The same kind of argument can be used for the number A(n, d), 
where we may assume in our application that d = 2r +1 is odd. Let 
us fix any code C of distance d. Now we think of the set {0,1}” as the 
glass box, and of the |C| Hamming balls 


B(w,r) := {w’ € {0,1}": dy(w,w’) <r}, wec, 


as the peas. Since the code has distance 2r+1, all these Hamming balls 
are disjoint and correspond in our analogy to peas. Consequently, their 
number cannot be larger than the total number of words (the volume 
of the box) divided by the number of words in a single Hamming ball 
(the volume of a pea). The number of words at Hamming distance 


exactly i from w is ("). This implies 


=i 
Bownl= 3 ("), 
1=0 
and the following upper bound on A(n, 2r + 1) is obtained. 
8.4.2 Lemma (Sphere-packing bound).For all n and r, 


Hee an | 


For example, the sphere-packing bound gives A(7,3) < 16 (and so 
the Hamming code in our initial example is optimal), and 


A(17,3) < [131072/18| = 7281. 


In the following theorem, which is the main result of this section, an upper 
bound on A(n,d) is expressed as an optimum of a certain linear program. 


8.4.3 Theorem (The Delsarte bound). For integers n, i, t with 0 < i, 


t <n, let us put 
min(i,t) i aly 
Kini) = Yo (A) (Po '). 


j=0 ne 
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Then for every n and d, A(n,d) is bounded above by the optimum value of 
the following linear program in variables x9, 21,...,%n: 


Maximize vj +271 +-:-+2y 
subject to 2% =1 


x; = 0, 4=1,2,...,d-—1 
gp Ral) eS 0, t=1,2,...,n 
XO, U1,---5Un > 0. 


Example. Using the sphere packing bound, we have previously found 
A(17,3) < 7281. To compute the Delsarte bound, we solve the linear program 
in the theorem (after eliminating xo, «1,22, which are actually constants, we 
have 15 nonnegative variables and 17 constraints). The optimum value is 
65533, which implies A(17,3) < 6553. The current best upper bound is 6552, 
an improvement by only 1! 
Toward an explanation. The proof of the Delsarte bound will proceed as 
follows. With every subset C C {0,1}” we associate nonnegative real quan- 
tities Zo, %1,...,%n such that |C| = % +---+ %,. Then we will show that 
whenever C is a code with distance d, the <; constitute a feasible solution of 
the linear program in the theorem. It follows that the maximum of the linear 
program is at least as large as the size of any existing code C with distance d 
(but of course, it may be larger, since a feasible solution does not necessarily 
corresponds to a code). 

Given C C {0,1}”, the %; are defined by 

t= a -{(w,w’) €C?: dy(w,w’) =i}, i=0,...,n. 

Thus, £; is the number of ordered pairs of code words with Hamming dis- 
tance i, divided by the total number of code words. Since any of the |C|? 
ordered pairs contributes to exactly one of the z;, we have 


fo t+ #1 +---+%, = (Cl. 


Some of the constraints in the linear program in Theorem 8.4.3 are now 
easy to understand. We clearly have Zp = 1, since every w € C has distance 0 
only to itself. The equations <; = 0 through g_; = 0 hold by the assumption 
that C has distance d; that is, there are no pairs of code words with Hamming 
distance between 1 and d—1. Interestingly, this is the only place in the proof 
of the Delsarte bound where the assumption of C being a code with distance d 
is used. 

The remaining set of constraints is considerably harder to derive, and it 
lacks a really intuitive explanation. Thus, to prove Theorem 8.4.3, we have 
to establish the following. 


8.4.4 Proposition. Let C C {0,1}" be an arbitrary set, let &; = %;(C) be 
defined as above, and let t € {1,2,...,n}. Then we have the inequality 
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n 


So Kiln,i) > 0. 


i=0 


In the next lemma, J C {1,2,...,n} is a set of indices, and we write 
d},(w, w’) for the number of indices i € I with w; 4 wi, (thus, the components 
outside J are ignored). 


8.4.5 Lemma. Let I C {1,2,...,n} be a set of indices, and let C C {0,1}”. 
Then the number of pairs (w, w’) € C? with d4,(w, w’) even is at least as large 
as the number of pairs (w,w’) € C? with d4,(w,w’) odd. (In probabilistic 
terms, if we choose w, w’ € C independently at random, then the probability 
that they differ in an even number of positions from I is at least as large as 
the probability that they differ in an odd number of positions from I.) 


Proof. Let us write |w|; = |{i € I: w; = 1}, and let us set 
E={weC: |wl; iseven}, O={weEC: |w]; is odd}. 


From the equation d/,(w, w’) = |w @ w’|r, we see that if d4,(w, w’) is even, 
then |w|; and |w’|; have the same parity, and so w and w’ are both in € 
or both in O. On the other hand, for d},(w,w’) odd, one of w, w’ lies in 
€ and the other one in O. So the assertion of the lemma is equivalent to 
|E|? + |O|? > 2-|E|-|O], which follows by expanding (|E| — |O|)? > 0. 


8.4.6 Corollary. For every C C {0,1}" and every v € {0,1}” we have 


(Seer Sia: 
(w,w’)EC? 


Proof. This is just another way of writing the statement of Lemma 8.4.5. 
Indeed, if we set J = {i : v; = 1}, then (w @w’)?v = di,(w,w’), and 
hence the sum in the corollary is exactly the number of pairs (w, w’) with 
d4,(w,w’) even minus the number of pairs with d},(w, w’) odd. 

The corollary also has a quick algebraic proof, which some readers may 
prefer. It suffices to note that (w @ w’)? v has the same parity as (w+ w’)? 
(addition modulo 2 was replaced by ordinary addition of integers), and so 


Vv 


wow’)? v w+w’)?v 
(ane vv S- jer? 
(w,w’)EC? (w,w’)EC? 


= Y oy’ 


(w,w’)EC? 


= (Ser): 20; 


wec 
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Proof of Proposition 8.4.4. To prove the tth inequality in the proposition, 
ie., 0/9 Ki(n,i)- Z > 0, we sum the inequality in Corollary 8.4.6 over all 
v € {0,1}” of weight ¢. Interchanging the summation order, we obtain 


2 ES eye, 


(w,w’)EC?2 ve {0,1}”: |v|=t 


To understand this last expression, let us fix u = w @ w’ and write S(u) = 
evefol}": wjxe(— 1)". In this sum, the v with u’v = j are counted with 
sign (—1)%. How many v of weight ¢t and with u’v = j are there? Let i = 
|u| = dy(w,w’) be the number of 1’s in u. In order to form such a v, we 
need to put 7 ones in positions where u has 1’s and t — 7 ones in positions 


where u has 0’s. Hence the number of these v is () Gat and 


swe SHE) 


which we recognize as K;(n,7). So we have 


0 < 1 Ki(n, dy (w,w’)), 


(w,w’ )EC? 


and it remains to note that the number of times K;(n,i) appears in this 
sum is |C| - #;. This finishes the proof of Proposition 8.4.4 and thus also of 
Theorem 8.4.3. 


A small strengthening of the Delsarte bound. We have seen 
that Theorem 8.4.3 yields A(17,3) < 6553. We show how the inequal- 
ities in the theorem can be slightly strengthened using a parity argu- 
ment, which leads to the best known upper bound A(17,3) < 6552. 
Similar tricks can improve the Delsarte bound in some other cases as 
well, but the improvements are usually minor. 

For contradiction let us suppose that n = 17 and there is a code 
C C {0,1}” of distance 3 with |C| = 6553. The size of C is odd, and we 
note that for every code of odd size and every t, the last inequality in 
the proof of Corollary 8.4.6 can be strengthened to 


Ler) > 1, 


since an odd number of values from {—1,1} cannot sum to zero. If we 
propagate this improvement through the proof of Proposition 8.4.4, 
we arrive at the following inequality for the Z;: 
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Since in our particular case we suppose |C| = 6553, we can replace 

the constraints )7i".) K;(n,i) +2; > 0, t = 1,2,...,n, in the linear 

program in Theorem 8.4.3 by 07) Ki(n,i)- 2; > ('?) /6553, and the 

x; defined by our C remain a feasible solution. However, the optimum 

of this modified linear program is only 65523, which contradicts the 

assumption |C| = 6553. This proves A(17,3) < 6552. 
The paper 


M.R. Best, A. E. Brouwer, F.J.MacWilliams, A.M. Odlyzko, 
and N.J.A.Sloane: Bounds for binary codes of length less 
than 25, IEEE Trans. Inform. Theory 24 (1978), pages 81-— 
93. 


describes this particular strengthening of the Delsarte bound and some 
similar approaches. A continually updated table of the best known 
bounds for A(n, d) for small n and d is maintained by Andries Brouwer 
at 


http://www.win.tue.nl/~aeb/codes/binary-1.html. 


The Delsarte bound explained. The result in Theorem 8.4.3 goes 
back to the thesis of Philippe Delsarte: 


P. Delsarte: An algebraic approach to the association schemes 
of coding theory, Philips Res. Repts. Suppl. 10 (1973). 


Here we sketch Delsarte’s original proof. At a comparable level of 
detail, his proof is more involved than the ad hoc proof above (from 
the paper by Best et al.). On the other hand, Delsarte’s proof is more 
systematic, and even more importantly, it can be extended to prove a 
stronger result, which we mention below. 

For i € {0,...,n}, let M; be the 2” x 2” matrix defined by! 


1 if di(v, w) a 
0 otherwise. 


(Mibvw = { 


The set of matrices of the form 
nm 
Sou Mi, YOr-++.¥n ER, 
i=0 


is known to be closed under addition and scalar multiplication (this is 
clear), and under matrix multiplication (this has to be shown). A set 


' We assume that rows and columns are indexed by the words from {0, 1}”. 


164 8. More Applications 


of matrices closed under these operations is called a matrix algebra. 
In our case, one speaks of the Bose—Mesner algebra of the Ham- 
ming association scheme. The matrix multiplication turns out to 
be commutative on this algebra, and this is known to imply a strong 
condition: The M; have a common diagonalization, meaning that there 
is an orthogonal matrix U with U? M,U diagonal for all i. 

Once we know this, it is a matter of patience to find such a ma- 
trix U. For example, the matrix U defined by 


i viw 
Uv w = ana (—}) 
will do. First we have to check (this is easy) that this matrix is indeed 
orthogonal, meaning that U7U = I). 
For the entries of U'M,U, we can derive the formula 


1 ulv uw 
(UT MU vw = sa S- Se ee, 


(u,u’)e({0,1}")? 
dy(u,u’)=i 


We claim that this sum evaluates to 0 whenever v 4 w; this will imply 
that U? M,U is indeed a diagonal matrix. To prove the claim, we let 
j be any index for which vj 4 w;. In the sum, we can then pair up the 
terms for (u, u’) and (u@e,,u’ Ge,), with e; being the word with a 
1 exactly at position 7. This pairing covers all terms of the sum, and 
paired-up terms are easily seen to cancel each other. (If you didn’t 
believe UTU = I,,, this can be shown with an even simpler pairing 
argument along these lines.) 

On the diagonal, we get 


1 ' 
(UT MU) ww — S- [jes )Pw 
(u,u’)e({0,1}")? 
dy (u,u’)=i 


viw 
Sy (HY = Kiln, [wl) 
ve {0,1}” 
|v|=2 


(for the last equality see the proof of Proposition 8.4.4) since any v of 
weight 7 can be written in the form v = u@uw’ in 2” different ways, 
one for each u € {0, 1}”. 

Next, let us fix a code C and look at a specific matrix in the Bose— 
Mesner algebra. For this, we define the values 


i: = {(w,w’) € C2: dy(w,w’) = i}| 
an (7) 


4=0,...,7. 
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We note that y; is the probability that a randomly chosen pair of 
words with Hamming distance 7 is a pair of code words. Moreover, 
y; is related to our earlier quantity x; via 


& eae 
2n(7) 
Here comes Delsarte’s main insight. 


8.4.7 Lemma. The matrix M = Se yiM,; is positive semidefinite. 


Proof. We first observe that 


My w = ti; (8.8) 


where 1 = dy(v,w). We will express M as a positive linear combi- 
nation of matrices that are obviously positive semidefinite. We start 
with the matrix X© defined by 


xe _ 1 if(v,w)€C? 
vw) 0. otherwise. 


This matrix is positive semidefinite, since it can be written in the form 


Xe = xl (xey* 


9 
where x© is the characteristic vector of C: 


x ; 
Ww. 0 otherwise. 


Let II be the automorphism group of {0,1}", consisting of the 
n!2” bijections that permute indices and swap 0’s with 1’s at selected 
positions. With 7 chosen uniformly at random from II, we obtain? 


Prob| (v,w) € r(c)] = i S- [(v.w) €n(C)* 
nell 


Se RO), 
[| rell 


On the other hand, 
Prob| (v, w) € r(C)?] = Prob|(r~!(v),7-1(w)) EC?| = %, 


since T~! is easily shown to map (v,w) to a random pair of words 


with Hamming distance 7. 


? The indicator variable [P] of a statement P has value 1 if P holds and 0 otherwise. 
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Using (8.8), this shows that 
Pe | 
M =— S~ xr) 
m 2 


is a positive linear combination of positive semidefinite matrices and 
is therefore positive semidefinite itself. The lemma is proved. 


After diagonalization of M by the matrix U, the statement of 
Lemma 8.4.7 can equivalently be written as 


SoG: (UT MU) ww = >> G:Ki(n,|wl) > 0, w € {0,1}”. 
1=0 1=0 


This is true since diagonalization preserves the property of being posi- 
tive semidefinite, which for diagonal matrices is equivalent to nonneg- 
ativity of all diagonal entries. 

Taking into account the relation (8.7) between the gy; and our orig- 
inal z;, this implies the following inequalities for any code C. 


Fg a, $= th. (8.9) 


Observing that 

t\ (n—-t t\ (n-t 

GG) — OE) 
(7) 
we get (cf. the definition of AK, in Theorem 8.4.3) 

K;(n, t) i Ki(n, 2) 

(3) (") 

Under this equation, the inequalities in (8.9) are equivalent to those 
in Proposition 8.4.4 and we recover the Delsarte bound. 


Beyond the Delsarte bound. Alexander Schrijver generalized Del- 
sarte’s approach and improved the upper bounds on A(n,d) signifi- 
cantly in many cases: 


A. Schrijver: New code upper bounds from the Terwilliger al- 
gebra and semidefinite programming, [EEE Trans. Inform. 
Theory 51 (2005), pages 2859-2866. 


His work uses semidefinite programming instead of linear program- 
ming. 
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8.5 Sparse Solutions of Linear Systems 


A coding problem. We begin with discussing error-correcting codes again, 
but this time we want to send a sequence w € R* of k real numbers. Or 
rather not we, but a deep-space probe which needs to transmit its priceless 
measurements represented by w back to Earth. We want to make sure that 
all components of w can be recovered correctly even if some fraction, say 8%, 
of the transmitted numbers are corrupted, due to random errors or even 
maliciously (imagine that the secret Brotherhood for Promoting the Only 
Truth can somehow tamper with the signal slightly in order to document the 
presence of supernatural phenomena in outer space). We admit gross errors; 
that is, if the number 3.1415 is sent and it gets corrupted, it can be received 
as 2152.66, or 3.1425, or —10!, or any other real number. 

Here is a way of encoding w: We choose a suitable number n > k and a 
suitable nxk encoding matriz Q of rank k, and we send the vector z = Qw € 
R”. Because of the errors, the received vector is not z but z = z+ x, where 
x € R” is a vector with at most r = [0.08n| nonzero components. We ask, 
under what conditions on Q can z be recovered from z? 

Somewhat counterintuitively, we will concentrate on the task of finding 

the “error vector” x. Indeed, once we know x, we can compute w by solving 
the system of linear equations Qw = z = z—x. The solution, if one exists, is 
unique, since we assume that Q has rank k and hence the mapping w » Qw 
is injective. 
Sparse solutions of underdetermined linear systems. In order to 
compute x, we first reformulate the recovery problem. Let m = n—k and let A 
be an mxn matrix such that AQ = 0. That is, considering the k-dimensional 
linear subspace of R” generated by the columns of Q, the rows of A form 
a basis of its orthogonal complement. The following picture illustrates the 
dimensions of the matrices: 


k m 
n| Q AT 
AQ = 0 
SS 


n 


In the recovery problem we have 2 = Qw + x. Multiplying both sides by A 
from the left, we obtain Az = AQw + Ax = Ax. Setting b = Az, we thus get 
that the unknown x has to satisfy the system of linear equations Ax = b. 
We have m = n — k equations and n > m unknowns; the system is under- 
determined and it has infinitely many solutions. In general, not all of these 
solutions can appear as an error vector in the decoding problem (we note that 
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the multiplication by A above is not necessarily an equivalent transformation 
and so it may give rise to spurious solutions). However, we seek a solution x 
with the extra property |supp(x)| <r, where we introduce the notation 


supp(x) = {i € {1,2,...,n}: 2; £0}. 


As we will see, under suitable conditions relating n,m,r and A, such a sparse 

solution of Ax = b turns out to be unique (and thus it has to be the desired 

error vector), and it can be computed efficiently by linear programming! 
Let us summarize the resulting problem once again: 


Sparse solution of underdetermined system of linear equations 


Given an mxn matrix A with m <n, a vector b € R™, and an integer r, 
find an x € R” such that 


Ax=b = and_|supp(x)| <r (8.10) 


if one exists. 


The coding problem above is only one among several important practical 
problems leading to the computation of sparse solutions of underdetermined 
systems. We will mention other applications at the end of this section. From 
now on, we call any x satisfying (8.10) a sparse solution to Ax = b. (Warn- 
ing: This shouldn’t be confused with solutions of sparse systems of equations, 
which is an even more popular topic in numerical mathematics and scientific 
computing.) 

A linear algebra view. There is a simple necessary and sufficient condition 
guaranteeing that there is at most one sparse solution of Ax = b. 


8.5.1 Observation. With n,m,r fixed, the following two conditions on an 
mxn matrix A are equivalent: 


(i) The system Ax = b has at most one sparse solution x for every b. 
(ii) Every 2r or fewer columns of A are linearly independent. 


Proof. To prove the (more interesting) implication (ii)=(i), let us assume 
that x’ and x” are two different sparse solutions of Ax = b. Then y = x’ — 
x” # 0 has at most 2r nonzero components and satisfies Ay = Ax’— Ax” = 0, 
and hence it defines a linear dependence of at most 2r columns of A. 

To prove (i)=(ii), we essentially reverse the above argument. Supposing 
that there exists nonzero y € R” with Ay = 0 and |supp(y)| < 2r, we write 
y = x’ — x”, where both x’ and x” have at most r nonzero components. For 
example, x’ may agree with y in the first ||supp(y)|/2] nonzero components 
and have 0’s elsewhere, and x” = x’ —y has the remaining at most r nonzero 
components of y with opposite sign. We set b = Ax’, so that x’ is a sparse 
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solution of Ax = b, and we note that x” is another sparse solution since 
Ax" = Ax' — Ay = Ax’ =b. 


Let us note that (ii) implies that, in particular, m > 2r. On the other 
hand, if we choose a “random” 2rxn matrix A, we almost surely have every 
2r columns linearly independent.* So in the coding problem, if we set n so 
that n = k + 2r, choose A randomly, and let the columns of Q form a basis 
of the orthogonal complement of the row space of A, we seem to be done—a 
random A has almost surely every 2r columns linearly independent, and in 
such case, assuming that no more than r errors occurred, the sparse error 
vector x is always determined uniquely, and so is the original message w. 


Efficiency? But a major question remains—how can we find the unknown 
sparse solution x? Unfortunately, it turns out that the problem of computing 
a sparse solution of Ax = b is difficult (NP-hard) in general, even for A 
satisfying the conditions of Observation 8.5.1. 

Since the problem of finding a sparse solution of Ax = b is important 
and computationally difficult, several heuristic methods have been proposed 
for solving it at least approximately and at least in some cases. One of them, 
described next, turned out to be considerably more powerful than the others. 


Basis pursuit. A sparse solution x is “small” in the sense of having few 
nonzero components. The idea is to look for x that is “small” in another 
sense that is easier to deal with, namely, with small || + |v] +--+ + |an|- 


The last quantity is commonly denoted by ||x||1 and called the @,;-norm of x 


(while ||x|] = ||x||2 = \/27 +--- +22 is the usual Euclidean norm, which can 


also be called the éy-norm).4 We thus arrive at the following optimization 
problem (usually called basis pursuit in the literature): 


Minimize ||x||; subject tox € R” and Ax =b. (BP) 


3 In this book we don’t want to assume or introduce the knowledge required to 

state and prove this claim rigorously. Instead, we offer the following semiformal 
argument relying on a famous but nontrivial theorem. The condition of linear 
independence of every 2r columns can be reformulated as det(Ar) 4 0 for every 
2r-element I C {1,2,...,n}. Now for I fixed, det(Ar) is a polynomial of degree 
2r in the 2rn entries of A (it really depends only on 4r? entries but never mind), 
and the set of the matrices A with det(A;) = 0 is the zero set of this polynomial in 
R?"". The zero set of any nonzero polynomial is very “thin”; by Sard’s theorem, 
it has Lebesgue measure 0. Hence the matrices A with det(A;) = 0 for at least 
one I correspond to points in R?"” lying on the union of (5) zero sets, each of 
measure 0, and altogether they have measure 0. Therefore, such matrices appear 
with zero probability in any “reasonable” continuous distribution on matrices, 
for example, if the entries of A are chosen independently and uniformly from the 
interval [—1, 1]. 
The letter @ here can be traced back to the surname of Henri Lebesgue, the 
founder of modern integration theory. A certain space of integrable functions on 
[0,1] is denoted by £1 (0,1) in his honor, and £1 is a “pocket version” of this 
space consisting of countable sequences instead of functions. 
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By a trick we have learned in Section 2.4, this problem can be reformulated 
as a linear program: 


Minimize Uy tug +t-+++Un 

subject to Ax=b 
—u<x<u 
x,u€ R”", u>0. 


(BP) 


To check the equivalence of (BP) and (BP’), we just note that in an optimal 
solution of (BP’) we have u; = |x;| for every %. 

The basis pursuit approach to finding a sparse solution of Ax = b thus 
consists in computing an optimal solution x* of (BP) by linear programming, 
and hoping that, with some luck, this x* might also be the sparse solution 
or at least close to it. 

At first sight it is not clear why basis pursuit should have any chance of 
finding a sparse solution. After all, the desired sparse solution might have a 
few huge components, while x*, a minimizer of the ¢;-norm, might have all 
components nonzero but tiny. 

Surprisingly, experiments have revealed that basis pursuit actually per- 
forms excellently, and it usually finds the sparse solution exactly even in con- 
ditions that don’t look very favorable. Later these findings were confirmed 
by rather precise theoretical results. Here we state the following particular 
case of such results: 


8.5.2 Theorem (Guaranteed success of basis pursuit). Let 
m = [0.75n], 


and let A be a random mxn matrix, where each entry is drawn from the 
standard normal distribution N(0,1) and the entries are mutually indepen- 
dent.” Then with probability at least 1 — e~°™, where c > 0 is a positive 
constant, the matrix A has the following property: 


Ifb € R™ is such that the system Ax = b has a solution x with at 
most r = |0.08n| nonzero components, then X is a unique optimal 
solution of (BP). 


For brevity, we call a matrix A with the property as in the theorem BP- 
exact (more precisely, we should say “BP-exact for r,” where r specifies the 
maximum number of nonzero components). For a BP-exact matrix A we can 


° We recall that the distribution N(0,1) has density given by the Gaussian “bell 


fae 2 : : : : . 
curve” we #"/2 How can we generate a random number with this distribution? 


This is implemented in many software packages, and methods for doing it can 
be found, for instance, in 


D. Knuth: The Art of Computer Programming, Vol. 2: Seminumerical Al- 
gorithms, Addison-Wesley, Reading, Massachusetts, 1973. 
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thus find a sparse solution of Ax = b exactly and efficiently, by solving the 
linear program (BP’). 

Returning to the coding problem from the beginning of the section, we 
immediately obtain the following statement: 


8.5.3 Corollary. Let k be a sufficiently large integer, let us set n = 4k, 
m = 3k, let a random mxn matrix A be generated as in Theorem 8.5.2, 
and let Q be an nxk matrix of rank k with AQ = 0 (in other words, the 
column space of Q is the orthogonal complement of the row space of A). 
Then the following holds with probability overwhelmingly close to 1: If Q is 
used as a coding matrix to transmit a vector w € R*, by sending the vector 
Z = Qw € R”, then even if any at most 8% of the entries of z are corrupted, 
we can still reconstruct w exactly and efficiently, by solving the appropriate 
instance of (BP’). 


Drawing the elements of A from the standard normal distribution 
is not the only known way of generating a BP-exact matrix. Results 
similar to Theorem 8.5.2, perhaps with worse constants, can be proved 
by known techniques for random matrices with other distributions. 
The perhaps simplest such distribution is obtained by choosing each 
entry to be +1 or —1, each with probability 4 (and again with all 
entries mutually independent). 

A somewhat unpleasant feature of Theorem 8.5.2 and of similar 
results is that they provide a BP-exact matrix only with high proba- 
bility. No efficient method for verifying that a given matrix is BP-exact 
is known at present, and so we cannot be absolutely sure. In practice 
this is not really a problem, since the probability of failure (i.e., of 
generating a matrix that is not BP-exact) can be made very small by 
choosing the parameters appropriately, much smaller than the proba- 
bility of sudden death of all people in the team that wants to compute 
the sparse solution, for instance. Still, it would be nice to have explicit 
constructions of BP-exact matrices with good parameters. 


Random errors. In our coding problem, we allow for completely 
arbitrary (even malicious) errors; all we need is that there aren’t too 
many errors. However, in practice one may often assume that the er- 
rors occur at random positions, and we want to be able to decode 
correctly only with high probability, that is, for most (say 99.999%) 
of the (2) ossible locations of the r errors. It turns out that consid- 
erably stronger numerical bounds can be obtained in this setting: For 
example, Theorem 8.5.2 tells us that for m = |0.75n| and k =n—™m, 
we are guaranteed to fix any 0.08n errors with high probability, but it 
turns out that we can also fix most of the possible r-tuples of errors for 
r as large as 0.36n! For a precise statement see the paper of Donoho 
quoted near the end of the section. 
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Geometric meaning of BP-exactness. Known proofs of Theorem 8.5.2 or 
similar results use a good deal of geometric probability and high-dimensional 
geometry, knowledge which we want neither to assume nor to introduce in 
this book. We thus have to omit a proof. Instead, we present an appealing 
geometric characterization of BP-exact matrices, which is a starting point of 
existing proofs. 

For its statement we need to recall the crosspolytope, a convex polytope 
already mentioned in Section 4.3. We will denote the n-dimensional crosspoly- 
tope by BY, which should suggest that it is the unit ball of the @;-norm: 


BP ={x ER": |x|], <1}. 


8.5.4 Lemma (Reformulation of BP-exactness). Let A be anmxn ma- 
trix, m <n, let r <_m, and let L = {x € R” : Ax = 0} be the kernel (null 
space) of A. Then A is BP-exact for r if and only if the following holds: For 
every z € R” with ||z||1 =1 (i.e., z is a boundary point of the crosspolytope) 
and |supp(z)| <r, the affine subspace L +-z intersects the crosspolytope only 
at z; that is, (LD +2) BY = {z}. 


Let us discuss an example with n = 3, m = 2, and r = 1, about the 
only sensible setting for which one can draw a picture. If the considered 
2x3 matrix A has full rank, which we may assume, then the kernel L is a 
one-dimensional linear subspace of R°, that is, a line passing through the 
origin. The points z coming into consideration have at most r = 1 nonzero 
coordinate, and they lie on the boundary of the regular octahedron B}, and 
hence they are precisely the 6 vertices of B?: 


The line L through the origin is drawn thick, and the condition in the lbmma 
says that each of the 6 translates of Z to the vertices should only touch the 
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crosspolytope. Another way of visualizing this is to look at the projection of 
B} to the plane orthogonal to L. Each of the translates of L is projected to a 
point, and the crosspolytope is projected to a convex polygon. The condition 
then means that all the 6 vertices should appear on the boundary in the 
projection, as in the left picture below, 


while the right picture corresponds to a bad L (the condition is violated at 
the two vertices marked by dots that lie inside the projection). In general, 
of course, L is not a line but a k-dimensional linear subspace of R”, and the 
considered points z are not only vertices of B?, but they can lie in all (r—1)- 
dimensional faces of BY. Indeed, we note that the points z on the surface of 

{ with at most r nonzero components are exactly the points of the union 
of all (r—1)-dimensional faces, omitting the easy proof of this fact (but look 


at least at the case n = 3, r = 2). 


Proof of Lemma 8.5.4. First we assume that A is BP-exact, we consider 
a point z with ||z||; = 1 and |supp(z)| < r, and we set b = Az. Then the 
system Ax = b has a sparse solution, namely z, and hence z has to be the 
unique point among all solutions of Ax = b that minimize the ¢,-norm. 
Noting that the set of all solutions of Ax = b is exactly the affine subspace 
L+2, we get that z is the only point in D+z with @;-norm at most 1. That 
is, (LD +2) BY = {z} as claimed. 

Conversely, we assume that L satisfies the condition in the lemma and we 
consider b € R™. Let us suppose that the system Ax = b has a solution x 
with at most r nonzero components. If x = 0, then b = 0, and clearly, 0 is 
also the only optimum of (BP). For x # 0, we set z = Tre Then ||z||, = 1 
and |supp(z)| <r, and so by the assumption, z is the only point in L + z of 
é;-norm at most 1. By rescaling we get that x is the only point in L +x of 
é;-norm at most ||X||1, and since L + x is the set of all solutions of Ax = b, 
we get that A is BP-exact. 


Intuition for BP-exactness. We don’t have means for proving 
Theorem 8.5.2, but now, using the lemma just proved, we can at least 
try to convey some intuition as to why a claim like Theorem 8.5.2 is 
plausible, and what kind of calculations are needed to prove it. 

The kernel L of arandom matrix A defines a random k-dimensional 
subspace® of R”, where k = n—m. For proving Theorem 8.5.2, we need 


® The question, “What is a random k-dimensional subspace?” is a subtle one. For 
us, the simplest way out is to define a random k-dimensional subspace as the 
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to verify that L is good for every boundary point z of the crosspolytope 
with |supp(z)| <r, where we say that L is good for z if (L+z)N BP? = 
{z}. 

For z as above, let us define a convex cone C, = {t(x —z): t> 
0, x € BP}. Geometrically, we take the cone generated by all rays 
emanating from z and intersecting the crosspolytope in a point other 
than z, and we translate the cone so that z is moved to the origin. 
Then L good for z means exactly that LM Cz = {0}. 

The points z on the boundary with at most r nonzero coordi- 
nates fill out exactly the union of all (r—1)-dimensional faces of the 
crosspolytope. Let F’ be one of these faces. It can be checked that the 
cone Cy, is the same for all z in the relative interior of F’, so we can 
define the cone Cr associated with the face F' (the reader may want 
to consider some examples for the 3-dimensional regular octahedron). 
Moreover, if y is a boundary point of F’, then Cy C Cr, and so if L 
is good for some point in the relative interior of F’, then it is good for 
all points of F including the boundary. 

Let pr denote the probability that a random L is bad (i.e., not 
good) for some z € F. Then the probability that L is bad for any 
z at all is no more than )),,pr, where the sum is over all (r—1)- 
dimensional faces of the crosspolytope. 

It is not too difficult to see that the number of (r—1)-dimensional 
faces is (") 2”, and that the cones Cp of all of these faces are congruent 
(they differ only by rotation around the origin). Therefore, all pr equal 
the same number p = p(n,k,r), and the probability of L bad for at 
least one z is at most (”)2"p. If we manage to show, for some values 
of n, k, and r, that the expression i) 2”p is much smaller than 1, 
then we can conclude that a random matrix A is BP-exact with high 
probability. 

Estimating p(n, k,r) is a nontrivial task; its difficulty heavily de- 
pends on the accuracy we want to attain. Getting an estimate that is 
more or less accurate including numerical constants, such as is needed 
to prove Theorem 8.5.2, is quite challenging. On the other hand, if we 
don’t care about numerical constants so much and want just a rough 
asymptotic result, standard methods from high-dimensional convexity 
theory lead to the goal much faster. 


kernel of a random mxn matrix with independent normal entries as in Theo- 
rem 8.5.2. Fortunately, this turns out to be equivalent to the usual (and “right” ) 
definition, which is as follows. One fixes a particular k-dimensional subspace Ro, 
say the span of the first k vectors of the standard basis, and defines a random 
subspace as a random rotation of Ro. This may not sound like great progress, 
since we have just used the equally problematic-looking notion of random rota- 
tion. But the group SO(n) of all rotations in R” around the origin is a compact 
group and hence it has a unique invariant probability measure (Haar measure), 
which defines “random rotation” satisfactorily and uniquely. 
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Here we conclude this very rough outline of the argument with 
a few words on why one should expect p(n,k,1r) to be very small 
for k and r much smaller than n. Roughly speaking, this is because 
for n large, the n-dimensional crosspolytope is a very lean and spiky 
body, and the cones Cp are very narrow for low-dimensional faces F’. 
Hence a random subspace L of not too large dimension is very likely to 
avoid Cr. As a very simplified example of this phenomenon, we may 
consider k = r = 1. Then F is a vertex and Cr is easily described. As 
a manageable exercise, the reader may try to estimate the fraction of 
the unit sphere centered at 0 that is covered by Cr; this quantity is 
exactly half of the probability p(n, 1,1) that a random line through 0 
intersects Cr nontrivially. 


References. Basis pursuit was introduced in 


S. Chen, D.L. Donoho, and M.A. Saunders: Atomic decom- 
position by basis pursuit, STAM J. Scientific Computing 20, 
1(1999) 33-61. 


A classical approach to finding a “good” solution of an underdeter- 
mined system Ax = b would be to minimize ||x||2, rather than ||x||1 
(a “least squares” or “generalized inverse” method), which typically 
yields a solution with many nonzero components and is much less suc- 
cessful in applications such as the decoding problem. Basis pursuit, 
by minimizing the ¢;-norm instead, yields a basic solution with only 
a few nonzero components. 

Interestingly, several groups of researchers independently arrived 
at the concept of BP-exactness and obtained the following general 
version of Theorem 8.5.2: For every constant a € (0,1) there exists 
B = B(a) > 0 such that a random |an|xn matrix A is BP-exact for 
r = |@n]| with probability exponentially close to 1. Combined results 
of two of these groups can be found in 


E. J.Candés, M. Rudelson, T.Tao, and R. Vershynin: Error 
correction via linear programming, Proc. 46th IEEE Sym- 
posium on Foundations of Computer Science (FOCS), 2005, 
pages 295-308. 


A third independent proof was given by Donoho. Later, and by yet an- 
other method, he obtained the strongest known quantitative bounds, 
including those in Theorem 8.5.2 (the previously mentioned proofs 
yield a much smaller constant for a = 0.75 than 0.08). Among his 
several papers on the subject we cite 


D. Donoho: High-dimensional centrally symmetric polytopes 
with neighborliness proportional to dimension, Discrete and 
Computational Geometry 35(2006), 617-652, 
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where he establishes a connection to the classical theory of convex 
polytopes using a result in the spirit of Lemma 8.5.4 and the remarks 
following it (but more elaborate). Through this connection he obtained 
an interesting upper bound on ((a). For example, in the setting of 
Theorem 8.5.2, there exists no |0.75n| xn BP-exact matrix at all with 
r > 0.25n (assuming n large). Additional upper bounds, essentially 
showing the existence results for BP-matrices in the above papers to 
be asymptotically optimal, were proved by 


N. Linial and I. Novik: How neighborly can a centrally sym- 
metric polytope be?, Discr. Comput. Geom., in press. 


We remark that our notation is a compromise among the notations 
of the papers quoted above and doesn’t follow any of them exactly, 
and that the term “BP-exact” is ours. 


More applications of sparse solutions of underdetermined 
systems. The problem of computing a sparse solution of a system of 
linear equations arises in signal processing. The signal considered may 
be a recording of a sound, a measurement of seismic waves, a picture 
taken by a digital camera, or any of a number of other things. A clas- 
sical method of analyzing signals is Fourier analysis, which from a 
linear-algebraic point of view means expressing a given periodic func- 
tion in a basis consisting of the functions 1, cos x, sin, cos 2x, sin 2x, 
cos 32, sin3z,...(the closely related cosine transform is used in the 
JPEG encoding of digital pictures). These functions are linearly inde- 
pendent, and so the expression (Fourier series) is unique. In the more 
recent wavelet analysis’ one typically has a larger collection of basic 
functions, the wavelets, which can be of various kinds, depending on 
the particular application. They are no longer linearly independent, 
and hence there are many different ways of representing a given signal 
as a linear combination of wavelets. So one looks for a representa- 
tion satisfying some additional criteria, and sparsity (small number of 
nonzero coefficients) is a very natural criterion: It leads to an economic 
(compressed) representation, and sometimes it may also help in ana- 
lyzing or filtering the signal. For example, let us imagine that there is 
a smooth signal that has a nice representation by sine and cosine func- 
tions, and then an impulsive noise made of “spike” functions is added 
to it. We let the basic functions be sines and cosines and suitable spike 
functions, and by computing a sparse representation in such a basis 
we can often isolate the noise component very well, something that 
the classical Fourier analysis cannot do. Thus, we naturally arrive at 
computing sparse solutions of underdetermined linear systems. 
Another source is computer tomography (CT), where one has an 
unknown vector x (each x; is the density of some small area of tissue, 


” Indeed, the newer picture encoding standard JPEG 2000 employs wavelets. 
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say), and the CT scanner measures various linear combinations of 
the x;, corresponding to rays through the tissue in various directions. 
Sometimes there are reasons to expect that only a small number of 
the pixels will have values x; different from the background level, and 
when we want to reconstruct x from the scan, we again ask for a 
sparse solution of a linear system. (More realistically, although less 
intuitively, we don’t expect a small number of nonzero pizels, but 
rather a small number of significantly nonzero coefficients in a suitable 
wavelet representation. ) 


8.6 Transversals of d-Intervals 


This section describes an application of the duality theorem in discrete geom- 
etry and combinatorics. We begin with a particular geometric result. Then 
we discuss concepts appearing in the proof in a more general context. 


Helly’s and Gallai’s theorems. First let J = {h, Io,...,I,} bea family of 
closed intervals on the real line such that every two of the intervals intersect. 
It is easily seen that there exists a point common to all of the intervals in 7: 
Indeed, the rightmost among the left endpoints of the J; is such a point. 
In more detail, writing J; = [a;,b;] and setting a = max{aj,...,a,}, we 
necessarily have a; < a < 0; for all 2, since a; < a is immediate from the 
definition of a, and if we had a > 6; for some i, then I; = [a;,b;] would be 
disjoint from the interval beginning with a. 

The statement just proved is a special (one-dimensional) case of a beau- 
tiful and important theorem of Helly: If C,,C2,...,Cn are convex sets in R4 
such that any at most d+1 of them have a point in common, then there is a 
point common to all of the C;. We will not prove this result; it is mentioned 
as a background against which the forthcoming results can be better appre- 
ciated. 

It is easily seen that in general we cannot replace d+ 1 by any smaller 
number in Helly’s theorem. For example, in the plane, the assumption of 
Helly’s theorem requires every three of the sets to have a common point, 
and pairwise intersections are not enough. To see this, we consider n lines in 
general position. They are convex sets, every two of them intersect, but no 
three have a common point. 

Let us now consider planar convex sets of a special kind, namely, cir- 
cular disks. One can easily draw three disks such that every two inter- 
sect, but no point is common to all three. However, there is a theorem 
(Gallai’s) for pairwise intersecting disks in the spirit of Helly’s theorem: If 
D = {D,, Do,..., Dn} is a family of disks in the plane in which every two 
disks intersect, then there exist 4 points such that each D; contains at least 
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one of them. With the (best possible) constant 4 this is a quite difficult the- 
orem, but it is not too hard to prove a similar theorem with 4 replaced by 
some large constant. The reader is invited to solve this as a puzzle. 

A set of points as in the theorem that intersects every member of D is 
called a transversal of D (sometimes one also speaks of piercing or stab- 
bing D by a small number of points). Thus pairwise intersecting disks in 
the plane always have a 4-point transversal, and Helly’s theorem asserts that 
(d+ 1)-wise intersecting convex sets in R@ have a one-point transversal. 

What conditions on a family of sets guarantee that it has a small 
transversal? This fairly general question subsumes many interesting particu- 
lar problems and it has been the subject of much research. Here we consider 
a one-dimensional situation. At the beginning of the section we dealt with a 
family of intervals, and now we admit intervals with some bounded number 
of “holes.” 


Transversals for pairwise intersecting d-intervals. For an integer 
d>1, a d-interval is defined as the union of d closed intervals on the 
real line. The following picture shows three pairwise intersecting 2-intervals 
(drawn solid, dashed, and dash-dotted, respectively) with no point common 
to all three: 


Thus, we cannot expect a one-point transversal for pairwise intersecting d- 
intervals. But the following theorem shows the existence of a transversal 
whose size depends only on d: 


8.6.1 Theorem. Let 7 be a finite family of d-intervals such that J, N Jz 4 0 
for every Ji, Jo € J. Then J has a transversal of size 2d?; that is, there exist 
2d? points such that each d-interval of J contains at least one of them. 


At first sight it is not obvious that there is any bound at all for the 
size of the transversal that depends only on d. This was first proved 
in 1970, with a bound exponential in d, in 


A. Gyarfas, J. Lehel: A Helly-type problem in trees, in Combi- 
natorial Theory and its Applications, P. Erdés, A. Rényi, and 
V.T. Sds, editors, North-Holland, Amsterdam, 1970, pages 
571-584. 


The best bound known at present is d?, and it has been established 
using algebraic topology in 
T. Kaiser: Transversals of d-intervals, Discrete Comput. 
Geom. 18(1997) 195-203. 


We are going to prove a bound that is worse by a factor of 2, but the 
proof presented here is much simpler. It comes from 
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N. Alon: Piercing d-intervals, Discrete Comput. Geom. 
19(1998) 333-334, 


following a general method developed in 


N. Alon, D. Kleitman: Piercing convex sets and the Hadwiger-— 
Debrunner (p,q)-problem, Adv. Math. 96(1992) 103-112. 


It is also known that the bound cannot be improved below a constant 
2 
multiple of ae see 


J. Matousek: Lower bounds on the transversal numbers of d- 
intervals, Discrete Comput. Geom. 26(2001) 283-287. 


Working toward a proof of the theorem, we first show that in a family of 
pairwise intersecting d-intervals, some point has to be contained in “many” 
members of the family. For reasons that will become apparent later, we for- 
mulate it for a finite sequence of d-intervals, so that repetitions are allowed. 


8.6.2 Lemma. Let J;, Jo,...,Jn be d-intervals such that J, J; 4 0 for all 
i,7 € {1,2,...,n}. Then there is an endpoint of some J; that is contained in 
at least n/2d of the d-intervals. 


Proof. Let T be the set of all ordered triples (p,i,7) such that p is one of 
the at most d left endpoints of J;, and p € J;. We want to bound the size 
of T from below. Let us fix 7 < 7 for a moment. Let p be the leftmost point of 
the (nonempty) intersection J; J;. Clearly, p is a left endpoint of J; or J;, 
and thus T contains one of the triples (p,i,7) or (p, j,i). So every pair i,j, 
i < j, contributes at least one member of T, and thus |T| > (3) +n > n?/2. 
Since there are at most dn possible values for the pair (p,7), it follows that 
there exist po and io such that (po,i9,7) € T for at least (n?/2)/dn = n/2d 
values of 7. This means that po is contained in at least n/2d of the J;. 


Next, we show that for every family of pairwise intersecting d-intervals 
we can distribute weights on the endpoints such that every d-interval has 
relatively large weight (compared to the total weight of all points). 


8.6.3 Lemma. Let J be a finite family of pairwise intersecting d-intervals, 
and let P denote the set of endpoints of the d-intervals in J. Then there are 
nonnegative real numbers xp, p € P, such that 


S- Lp = 1 forevery JE J, and 


So tp < 2d. 
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Before proving the lemma, let us see how it implies Theorem 8.6.1. Given 
J and the weights as in the lemma, we will choose a set X C P such that 
|X| < 2d?, and each of the |X| +1 open intervals into which X divides the 
real axis contains points of P of total weight less than 1/d. Such an X can 
be selected from P by a simple left-to-right scan: Let py < pz <--- < pm be 
the points of P. We consider them one by one in this order, and we include 
p; into X whenever >> peeby pe TDi i, where pg is the last point already 
included in X (and we formally put pe = —oo if no point has yet been included 
in X). It is clear that none of the open intervals determined by X contains 
weight 1/d or larger, and the bound on the size of X follows easily, using that 
the total weight of all points of P is at most 2d. 

We claim that X is a transversal of 7. Indeed, considering a J € J, at 
least one of the d components of J is an interval containing points of P of 
total weight at least 1/d, and so it contains a point of X. 


Proof of Lemma 8.6.3 by duality. We formulate the problem of choosing 
the weights x, as a linear program with variables x,, p € P: 


Minimize i) .cp Zp 
subject to ec ynp Zp 2 | for every JET, 
x > 0. 


The linear program is certainly feasible; for instance, x, = 1 for all p is a 
feasible solution. We would like to show that the optimum is at most 2d. 

Using the dualization recipe from Section 6.2, we find that the dual linear 
program has variables y;, J € J, and it looks as follows: 


Maximize i ye7 Ys 
subject to Ds: peJnP ys <1 for every p € P, 


y 20. 


The dual linear program is feasible, too, since y = 0 is feasible. Then by the 
duality theorem both the primal and the dual linear programs have optimal 
solutions, x* and y*, respectively, that yield the same value of the objective 
functions. 

We may assume that y* is rational: Indeed, we may take it to be a basic 
feasible solution, and all basic feasible solutions are rational since all coeffi- 
cients of the linear program are rational. 

We now have some rational weight y’, for every J € J such that no point 
of P is contained in d-intervals of total weight exceeding 1, and we want to 
show that the total weight of all J € Z cannot be larger than 2d. 

Lemma 8.6.2 tells us that if all the d-intervals had the same weight, then 
there would be a point contained in d-intervals of weight at least W/2d, where 
W is the total weight of all d-intervals. Our weights need not all be equal, 
but we will pass to the case of equal weights by replicating each d-interval a 
suitable number of times. 
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Let D be a common denominator of all the rational numbers y5, J € J, 
and let yy = 44, with ry integral. Let (Ji, J2,...,Jn) be a sequence that 
includes each d-interval J € J exactly ry times (thus n = ))j¢.77r7). By 
Lemma 8.6.2 there is a point p € P contained in at least n/2d members of 


the sequence, which means that 
n 1 
>—=— 
Dy We ay age 
JET: pEI JET 


Dividing both sides by the common denominator D and multiplying by 2d 


gives 
2d- Soy > doy. 


JET: ped JET 
Since y* is a feasible solution of the dual linear program, the left-hand side is 
at most 2d, and so }) <7 yy < 2d. This concludes the proof of Lemma 8.6.3 
as well as of Theorem 8.6.1. 


Transversal number and matching number. Let us now look 
at some of the concepts appearing in the proof of Theorem 8.6.1 in 
a general context. Let V be an arbitrary finite set, and let F be a 
system of subsets of V. 

A set X C V is called a transversal of F if Fn X 4 @ for every 
F e€ F. The transversal number 7(F) is the minimum possible 
number of elements of a transversal of F. 

Determining or estimating the transversal number of a given set 
system is an important basic problem in combinatorics and combina- 
torial optimization, including many other problems as special cases. 
For example, if we consider a graph G = (V, FE), and view the edges 
as two-element subsets of V, then a transversal of EF is exactly what 
was called a vertex cover in Section 3.3. 

Another useful notion is the matching number of F, denoted 
by v(F) and defined as the maximum number of sets in a subsystem 
M C F such that no two distinct sets fF, Fy € M intersect (such an 
M is called a matching). 

It is easily seen that v(F) < 7(F) for every F: If F has matching 
number k, then F contains k pairwise disjoint sets; thus, we need at 
least / points to get a transversal of F. 

For the graph example, v(F) is exactly the number of edges in a 
maximum matching. This is also where the name “matching number” 
comes from. 


The condition in Theorem 8.6.1 that every two d-intervals in 7 
intersect can be rephrased as v(Z) = 1. More generally, v(F) < k 
means that among every k + 1 members of F we can find two that 
intersect. There is a more general version of Theorem 8.6.1 stating 
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that T(J) < 2d?v(Z) for every finite family of d-intervals. The proof 
is very similar to the one shown for Theorem 8.6.1, except that the 
analogue of Lemma 8.6.2 needs the well-known Turan’s theorem from 
graph theory. 
Fractional transversals and matchings. In the proof of Theo- 
rem 8.6.1 we have implicitly used another parameter of a set system, 
which always lies between v(F) and 7(F) and which, unlike r(F) and 
v(F), is efficiently computable. This new parameter can be introduced 
in two seemingly different ways, which turn out to be equivalent by 
the duality theorem of linear programming. 

A fractional transversal of F is any feasible solution x to the 
linear program 


minimize vev Lo 
subject to So ,ep%y = 1 for every F € Ff, 
x > 0. 


So in a fractional transversal one can take, say, one-third of one point 
and two-thirds of another, but for each set, the fractions for points 
in that set must add up to at least 1, one full point. The fractional 
transversal number 7*(F) is the optimal value of the objective func- 
tion, i.e., the minimum possible total weight of a fractional transversal. 

Every transversal T corresponds to a fractional transversal, given 
by z, = 1ifv € T and x, = 0 otherwise, and thus 7*(F) < r(F) for 
every F. 

A fractional matching for F is any feasible solution y to the 
linear program 


maximize )° yp 
subject to Sop.,ep yr <1 for every vEV, 
y 20, 


and the optimal value of the objective function is the fractional match- 
ing number v*(F). 

Every matching M yields a fractional matching (we put yr = 1 
for F € M and yr = 0 otherwise). Thus, v(F) < v*(F). 

Since the linear programs for 7* and for v* are dual to each other, 
we always have v*(F) = r*(F), and altogether we have the chain of 
inequalities 

UF) <U(F) =1°(F) < r(F). 

We remark that if F is the set of edges of a bipartite graph, then 
K6nig’s theorem (Theorem 8.2.2) asserts exactly that v(F) = T(F). 
On the other hand, if F is the edge set of a triangle (that is, F = 
{{1;2}, {13}, {2,3} )), then v(F) =1 oF) = 3 < FF) = 2. 

The proof of Theorem 8.6.1 can now be viewed as follows: First one 
proves that v*(.7) < 2d for every family of d-intervals with v(.7) = 1, 
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and then one shows that 7(.7) < d-7*(Z). This proof scheme turned 
out to be very powerful and it works in many other cases as well. 


Fractional concepts. Besides the fractional matching and transver- 
sal numbers, many other “fractional” quantities appear in combina- 
torics and combinatorial optimization. The general recipe is to take 
some useful integer-valued parameter Q of a graph, say, reformulate 
its definition as an integer program, and introduce a “fractional Q” as 
the optimum value of a suitable LP relaxation of the integer program. 
In many cases such a fractional Q is useful for studying or approxi- 
mating the original Q. The book 


E.R. Scheinerman and D.H. Ullman: Fractional Graph The- 
ory, John Wiley & Sons, New York 1997, 


nicely presents such developments. Let us conclude this section by 
quoting an example from that book. We consider five committees, 
numbered 1,2,...,5, such that 1 and 2 have a common member, and 
so have 2 and 3, 3 and 4, 4 and 5, and 5 and 1, while any other pair 
of committees is disjoint. A one-hour meeting should be scheduled for 
each committee, and meetings of committees with a common member 
must not overlap. What is the length of the shortest time interval in 
which all the five meetings can be scheduled? 

It seems that a 3-hour schedule like the one below should be opti- 
mal: 


12:00 13:00 14:00 15:00 


However, if one of the committees is willing to break its meeting into 
two half-hour parts, then a shorter schedule is possible: 


12:00 13:00 14:00 = 14:30 


The first schedule corresponds to a proper coloring of the conflict graph 
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A 3 


by three colors, while the second schedule corresponds to a fractional 
coloring of the same graph, with value 2.5. 


8.7 Smallest Balls and Convex Programming 


The smallest ball problem. We are given points p1,...,Ppn € R?%, and we 
want to find a ball of the smallest radius that contains all the points.® 


This looks similar to some of the geometric optimization problems that we 
have addressed in Chapter 2, such as the problem of placing a largest possible 
disk inside a convex polygon. For the smallest ball problem, however, there 
is no simple trick that lets us write the problem as a linear program. 

We will see that it can be formulated as a convex quadratic program, 
which is in many respects the next best thing to a linear program. There 
are efficient solvers for convex quadratic programs, based on interior point 
methods or on simplex-type methods, and so this formulation can be used 
for computing a smallest enclosing ball in practice. 

We will also derive from this formulation that the smallest enclosing ball 
always exists, and it is determined uniquely. (This is intuitively very plausible; 
think of a shrinking rubber ball.) In the course of the proof, we will establish 


8 In the plane this is sometimes referred to as the smallest bomb problem, but we 
prefer not to elaborate this association into a real-life story. 
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a powerful criterion for optimality of a feasible solution of a convex program, 
known (in a much more general context) as the Karush-Kuhn-Tucker condi- 
tions. These conditions are of outstanding theoretical value, and they are the 
basis of efficient solution methods for many classes of optimization problems, 
including convex quadratic programming. The reader might still wonder, how 
does all of this relate to linear programming? We will use the duality theorem 
of linear programming to derive the Karush—-Kuhn—Tucker conditions. 

We begin by introducing convex programming, and we will return to the 
smallest enclosing ball problem later. 
A short introduction to convex programming. Let us recall that an 
n-variate function f:R” — R is convex if 


f((l—t)x + ty) < OF) +f) 
for all x,y € R” and all t € [0, 1]. Geometrically, the segment connecting the 
points (x, f(x)) and (y, f(y)) in R"*+ never goes below the graph of f. 
A convex program is the problem of minimizing a convex function in n 
variables subject to linear equality and inequality constraints.® 
The following picture illustrates a 2-dimensional convex programming 
problem, with a planar feasible region given by four inequality constraints: 


feasible 
region 


We note that the minimum need not occur at a vertex of the feasible region. 
Moreover, an optimal solution need not exist even if the convex function f(x) 
is bounded from below; an example is the problem of minimizing e~* subject 
to x > 0. We should also remark that, as is possible in linear programming, 
we cannot change minimization to maximization, since for f convex, —f is 
typically not convex (unless f is linear). Actually, maximizing a convex func- 
tion subject to linear constraints is a computationally difficult (NP-hard) 
problem. 


° Some sources allow other types of convex constraints in a convex program, but 
we don’t need this here. 
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Here we will consider convex programs in equational form: 


Minimize f(x) 
subject to Ax=b (8.11) 
x > 0, 


where A € R™*", b € R™, and f:R” — R is a convex function. 
In order to use calculus, we also assume that f is differentiable, with 
continuous partial derivatives. In this situation, the inequality 


f(x) 2 f(@) + VEZ) (x — 2) (8.12) 


holds for all x,z € R”, and this is an alternative characterization of convexity. 
We recall that 


3) 3) 
FPO aos Gfx] 


is the gradient (vector of partial derivatives) of f at z. Thus Vf(z)(x — z) is 
the scalar product of the row vector Vf(z) with the column vector x — z. 

Geometrically, the inequality says that the epigraph of f lies above all of 
its tangential hyperplanes. This has the following easy consequence. 


Vela) = ( 


8.7.1 Fact. Let C C R” be a convex set and f:R” — R a differentiable 
convex function. A vector x* minimizes f(x) over C' if and only if 


Vi(x*)(x —x*)>0 forall xe€EC. 


Proof. First we prove that the inequality implies optimality of x*. Using 
(8.12), Vf(x*)(x — x*) > 0 implies f(x) > f(x*), so if the former holds for 
all x € C, x* is a minimizer of f over C. 

For the other direction, let us assume that x* is a minimizer and that 
x € C.. We consider the convex combination 


x(t) :=x*+t(x-—x*)eEC, te[0,1], 


and we observe that 


0 . £(x(t)) — f(x") 
aif *M)le=o = lim ec ee 


must hold, for otherwise, f(x(t)) < f(x*) for some small t. On the other 
hand, we have 


a) * * 
apf leo = VEG) (x — x"), 


by the chain rule. This completes the proof. 


Next we formulate and prove the promised optimality criterion for convex 
programming. 
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8.7.2 Proposition (Karush—Kuhn—Tucker conditions). Let us  con- 
sider the convex program 


minimize f(x) 
subject to Ax=b 
x>0 


with f convex and differentiable, with continuous partial derivatives. A fea- 
sible solution x* € R” is optimal if and only if there is a vector y € R™ such 
that for all j € {1,...,n}, 


A =0 ifa*>0 
Vs Ty. J 
VIE) Og { >0 otherwise. 
Here a; is the jth column of A. 
The components of y are called the Karush—Kuhn—Tucker multipli- 
ers. 


Proof. First we assume that there is a vector y with the above properties, 
and we let x be any feasible solution to the convex program. Then we get 


Vf(x*) + y7A)x* = 0, 
Vf(x*)+y7A)x > 0. 


Subtracting the first equation from the second, the contributions of y’ A 
cancel (since Ax = b = Ax* by the feasibility of x and x*), and we conclude 
that 

Vi (x*)(x —x*) > 0. 


Since this holds for all feasible solutions x, the solution x* is optimal by 
Fact 8.7.1. 

Conversely, let x* be optimal, and let us set e = —Vf(x*). By Fact 8.7.1 
we have c! (x — x*) < 0 for all feasible solutions x, meaning that x* is an 
optimal solution of the linear program 


maximize c!x 


subject to Ax=b 
x > 0. 


According to the dualization recipe (Section 6.2), its dual is the linear pro- 
gram 

minimize b7y 

subject to Aly > ce, 


and the duality theorem implies that it has an optimal solution y satisfying 


bl’ y =c! x". (8.13) 
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Since y is a feasible solution of the dual linear program, we have ya; > Cj 
for all 7, and (8.13) implies 

(¥7.A—c?)x* = by — cc? x* = 0. 


So we have Vf(x*);+y’ aj = —cj+y" aj; > 0, with equality whenever «x > 0. 
Therefore, we have found the desired multipliers y. 


The fact that there is dualization for everyone implies Karush—Kuhn— 
Tucker conditions for everyone. We encourage the reader to work out the 
details, and we mention only one special case here: A feasible solution x* of 
the convex programming problem 


minimize f(x) 
subject to Ax=b 


is optimal if and only if there exists a vector y such that 
Vi(x") +97 A = 07. 


In this special case, the components of y are called Lagrange multipliers 
and can be obtained from x* through Gaussian elimination (also see Sec- 
tion 7.2, where the method of Lagrange multipliers is briefly described in a 
more general setting). If, in addition, f is a quadratic function, its gradient 
is linear, so the minimization problem itself (finding an optimal x* with a 
matching y) can be solved through Gaussian elimination. For example, the 
problem of fitting a line by the method of least squares, mentioned in Sec- 
tion 2.4, is of this easy type, because its bivariate quadratic objective function 
(2.1) is convex. 


Smallest enclosing ball as a convex program. In order to show that 
the smallest enclosing ball of a point set can be extracted from the solution 
of a suitable convex quadratic program, we use the Karush—Kuhn—Tucker 
conditions and the following geometric fact, which is interesting by itself. 


8.7.3 Lemma. Let S = {s1,...,8,} C R® be aset of points on the boundary 
of a ball B with center s* € R*. Then the following two statements are 
equivalent. 


(i) B is the unique smallest enclosing ball of S. 
(ii) For every u € R, there is an index j € {1,2,...,k} such that 


u’(s; —s*) <0. 


It is a simple exercise to show that (ii) can be reexpressed as fol- 
lows: There is no hyperplane that strictly separates S from s*. From 
the Farkas lemma (Section 6.4), one can in turn derive the follow- 
ing equivalent formulation: The point s* is in the convex hull of the 
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points S. We thus have a simple geometric condition that character- 
izes smallest enclosing balls in terms of their boundary points. From 
the geometric intuition in the plane, this is quite plausible: If s* is in 
the convex hull of $, then s* cannot be moved without making the 
distance to at least one point larger. But if s* is separated from S 
by a hyperplane, then moving s* toward this hyperplane results in an 
enclosing ball of smaller radius. The direction u of movement satisfies 
u’(s; —s*) > 0 for all j. 


$3 


Proof. We start by analyzing the distance between a point s; € S and a 
potential ball center s £ s*. Let r be the radius of the ball B. Given s £ s*, 
we can uniquely write it in the form 


s=s'+tu, 
where u is a vector of length 1 and t > 0. For 7 = 1,2,...,k we get 


(s;—s)7(8;-s) = (s;—s* —tu)?(s; —s* — tu) 
= (s;—s*)?(s; —s*)+t?u7u — 2tu?(s; —s*) 


r? +t? — 2tu’ (s; —s*). 


Given a € R and sufficiently small t > 0, we have t? — 2ta > 0 if and only if 
a <0. This shows that (for sufficiently small t) 

(s;—s)7(s;-s)>r?2 © ul (s;—s*) <0, (8.14) 
where the implication “<=” holds for all t > 0. 

This equivalence implies the two directions of the lemma. For (i)=(ii), we 
argue as follows: Since s* is the unique point with distance at most r from 
all points in S, we know that for every u with ||u|| = 1 and for all t > 0, 
the point s = s* + tu has distance more than r to one of the points in S. 
By the implication “=” of (8.14), there is some j with u7(s; —s*) < 0. To 
show (ii)=>(i), let us consider any point s of the form s* + tu # s*. Since 
there is an index j with u7(s; — s*) < 0, implication “=” of (8.14) shows 
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that s has distance more than r to some point in S. It follows that B is the 
unique smallest enclosing ball of S. 


Now we can state and prove the main result. 


8.7.4 Theorem. Let pi,...,Pn be points in R*, and let Q be the dx n 
matrix whose jth column is formed by the d coordinates of the point p;. Let 
us consider the optimization problem 


minimize x’?QTQx-— e7e4 x5 Pp; Pj 


subject to SN, aj =1 (8.15) 
x>0 
in the variables x1,...,2%n. Then the objective function f(x) := x’ Q?Qx — 


ae ©jP; Dj is convex, and the following statements hold. 


(i) Problem (8.15) has an optimal solution x*. 

(ii) There exists a point p* such that p* = Qx* holds for every optimal 
solution x*. Moreover, the ball with center p* and squared radius — f (x*) 
is the unique ball of smallest radius containing P. 


Proof. The matrix Q7Q is positive semidefinite, and from this the convexity 
of f is easy to verify (we leave it as an exercise). 

The feasible region of program (8.15) is a compact set (actually, a sim- 
plez), and we are minimizing a continuous function over it. Consequently, 
there exists an optimal solution x*. In order to apply the Karush—Kuhn— 
Tucker conditions, we need the gradient of the objective function: 


V f(x) a ax OO = (p) pi, Ps Po, cee ,P2 Pn): 


The program has only one equality constraint. With p* = Qx* = ye 1U;P;; 
Proposition 8.7.2 tells us that we find a 1-dimensional vector y = (yw) such 
that 

=0 ifa;>0 


>0 otherwise. (8.16) 


2p; p* — Pip uf 


Subtracting p*’ p* from both sides and multiplying by —1 yields 


«of =Ht+pt p* ifzt>0 
[Pj PTI) - rae 
<p+p** p* otherwise. 


This means that p* is the center of a ball of radius r = V+ p*? p* 
that contains all points from P and has the points p; with x} > 0 on the 
boundary. From (8.16) and x > 0 we also get 


n n n 
* * * * “Tp * 
w= dlajin=S)0jp7 Pj — 2) ep] P “Ls P)P; ~ 2p 
j=l j=l j=l 
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and r?2 = i xp) pj — p*’ p* = —f(x*) follows. 

It remains to prove that there can be no other ball of radius at most r 
that contains all points from P (this also shows that p* does not depend on 
the choice of x*). 

We define F = {j € {1,2,...,m} : 2} > O} and apply Lemma 8.7.3 with 
s* = p* and 

S={pj:j € F}. 


We already know that these points are on the boundary of a ball B of radius 
r around p* = >) j¢p2jpj. Using )) cp 7; = 1, we get that the following 
holds for all vectors u: 


Esutos—p)=0"(L aps - Dap’) =u" py =0 
JEF JCF jCF 


It follows that there must be some j € F' with u7(p; — p*) < 0. By Lemma 
8.7.3, B is the unique smallest enclosing ball of S C P, and this implies that 
B is the unique smallest enclosing ball of P as well. 


A recent book on the topics of this section is 


S.Boyd and L. Vandenberghe: Convex Optimization, Cam- 
bridge University Press, Cambridge 2004. 


9. Software and Further Reading 


LP-solvers. The most famous (and expensive) software package for solv- 
ing linear programs and integer programs is called CPLEX. Freely available 
codes with similar functionality, although not quite as strong as CPLEX, are 
lp-solve, GLPK, and CLP. The website 


www-neos.mcs.anl. gov/neos 


contains a guide to many other optimization software systems, and it also 
provides an overview of web solvers, to which one can send an input of an 
optimization problem and, with a bit of luck, be returned an optimum. 

The computational geometry algorithms library CGAL (www.cgal.org) 
contains software for solving linear and convex quadratic programs using 
exact rational arithmetic. We refer to the website of this book (http://www. 
inf .ethz.ch/personal/gaertner/lpbook) for further information. 
Books. The web bookstore Amazon offers more books with “linear pro- 
gramming” in the title than with “astrology,” and so it is clear that we can 
mention only a very narrow selection from the literature. 

A reasonably recent, accessible, and quite comprehensive (but not exactly 
cheap) textbook of linear programming is 


D. Bertsimas and J. Tsitsiklis: Introduction to Linear Optimization, 
Athena Scientific, Belmont, Massachusetts, 1997. 


Both linear and integer programming are treated on an excellent theoretical 
level in 


A. Schrijver: Theory of Linear and Integer Programming, Wiley- 
Interscience, New York 1986. 


The book 
V. Chvatal: Linear Programming, W. H. Freeman, New York 1983, 


was considered one of the best textbooks in its time and it is still used widely. 
And those liking classical sources may appreciate 


G. B. Dantzig: Linear Programming and Extensions, Princeton Uni- 
versity Press, Princeton 1963. 


Appendix: Linear Algebra 


Here we summarize facts and concepts from linear algebra used in this book. 
This part is not meant to be a textbook introduction to the subject. It is 
mainly intended for the reader who has some knowledge of the area but may 
have forgotten the exact definitions or may know them in a slightly different 
form. The number of introductory textbooks is vast; in order to cite at least 
one, we mention 


G. Strang: Introduction to Linear Algebra, 3rd edition, Wellesley- 
Cambridge Press, Wellesley, Massachusetts, 2003. 


Vectors. We work exclusively with vectors in R”, and so for us, a vector 
is an ordered n-tuple v = (v1, v2,.--,;Un) € R” of real numbers. We denote 
vectors by boldface letters, and v; denotes the 7th component of v. For u,v € 
R” we define the sum componentwise: 


ut+v = (ur + 4, U2 + V2,---,Un + Un). 


The multiplication of v € R” by a real number ¢ is also given componentwise, 
by 
tv = (tv, tv2,..., tun). 


We use the notation 0 for the zero vector, with all components 0, and 1 de- 
notes a vector with all components equal to 1. 

A linear subspace (or a vector subspace) of R” is a set V C R” that 
contains O and is closed under addition and multiplication by a real number; 
that is, if u,v € V and t € R, we have u+ v € V and tv © V. For example, 
the linear subspaces of R° are {0}, lines passing through 0, planes passing 
through 0, and R° itself. An affine subspace is any set of the form u+ V = 
{u+v: v € V} C R", where V is a linear subspace of R” and u € R”. 
Geometrically, it is a linear subspace translated by some fixed vector. The 
affine subspaces of R®° are all the one-point subsets, all the lines, all the planes, 
and R®. 


A linear combination of vectors v1, V2,-..,;Vm € R” is any vector of 
the form tiv, + toVve+---+tmVm, where t,,...,tm are real numbers. Vectors 
V1, V2,---;Vm € R” are linearly independent if the only linear combina- 


tion of V1,V2,---,;WVm equal to 0 has t; = tg =--- = tm = 0. Equivalently, 
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linear independence means that none of the v; can be expressed as a linear 
combination of the remaining ones. 

The linear span of a set X C R” is the smallest (with respect to in- 
clusion) linear subspace of R” that contains X. Explicitly, it is the set of all 
linear combinations of finitely many vectors from X. The linear span of any 
set, even the empty one, always contains 0, which is formally considered as 
a linear combination of the empty set of vectors. 

A basis of a linear subspace V C R” is a linearly independent set of 
vectors from V whose linear span is V. The standard basis of R” consists 
of the vectors e1, €2,...,@n, where e; is the vector with 1 at the zth position 
and 0’s elsewhere. 

All bases of a given linear subspace V have the same number of vectors, 

and this number dim(V) is the dimension of V. In particular, each basis of 
R” has n vectors. 
Matrices. A matrix is a rectangular table of numbers (real numbers, in our 
case). An mxn matrix has m rows and n columns. If a matrix is called A, 
then its entry in the 7th row and jth column is usually denoted by a;;. So, 
for example, a 3 x 4 matrix A has the general form 


A matrix is denoted by writing large parentheses to enclose the table of 
elements. A submatrix of a matrix A is any matrix that can be obtained 
from A by deleting some rows and some columns (including A itself, where 
we delete nothing). 

A matrix is multiplied by a number t by multiplying each entry by t. 
Two m xn matrices A and B are added by adding the corresponding entries. 
That is, if we set C= A+ B, we have qj = aij + bij for i= 1,2,...,m and 
Dees eee OF 

Matrix multiplication is more complicated. A product AB, where A 
and B are matrices, is defined only if the number of columns of A is the same 
as the number of rows of B. If A is an mx n matrix and B is an n x p matrix, 
then the product C = AB is an m x p matrix given by 


Cig = i411; + Ginbe; + +++ + Gindn;. 


Pictorially, 
Pp 
n Pp 
column 
m ‘ ‘ ; = m 
row 7 J nm e “ij 
A C 
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Matrix multiplication is associative, meaning that A(BC’) = (AB)C when- 
ever at least one of the two sides is defined, and this is quite often used in 
proofs. We also recall the well-known fact that in general, matrix multiplica- 
tion is not commutative; i.e., typically AB 4 BA. 

We also multiply matrices and vectors. In such context, a vector x € R” 
is usually considered as an nx1 matrix; thus, in the matrix form, a vector 
X = (21, 22,...,2n) should be written as a column: 


Ty 
XQ 
rn 


So if A is an mxn matrix and x € R” is a vector, then the product Ax is a 
vector in R™. 


This is used in the matrix notation Ax = b for a system of linear 
equations. Here A is a given mxn matrix, b € R™ is a given vector, and 
X = (41, 22,...,2n) is a vector of n unknowns. So Ax = b is a shorthand for 


the system of m equations 


ait, + ayo%Q +:+-+ Ann = by 
a21%, + ag9%Q +:::+ Aantn = be 
Qmit1 + Gm2%2 +:::+ Amntn = bm. 


If A is an mxn matrix, then A’ denotes the nxm matrix having the 
element aj; in the ith row and jth column. The matrix A? is called the 
transpose of the matrix A. For transposing the matrix product, we have the 
formula (AB)? = BTA’. 

A square matrix is an nxn matrix, i.e., one with the same number of 
rows and columns. A diagonal matrix is a square matrix D with dj; = 0 
for alli # 7; that is, it may have nonzero elements only on the diagonal. The 
nxn identity matrix I, has 1’s on the diagonal and 0’s elsewhere. For any 
mxn matrix A we have I,,A = AI, = A. 


Rank, inverse, and linear systems. Each row of an mxn matrix A 
can also be regarded as a vector in R”. The linear span of all rows of A is 
a subspace of R” called the row space of A, and similarly, we define the 
column space of A as the linear subspace of R™ spanned by the columns 
of A. An important result of linear algebra tells us that for every matrix 
A the row space and the column space have the same dimension, and this 
dimension is called the rank of A. In particular, an m x n matrix A has rank 
m if and only if the rows of A are linearly independent (which can happen 
only if m < n). An nxn matrix is called nonsingular if it has rank n; 
otherwise, it is singular. 
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Let A be a square matrix. A matrix B is called an inverse of A if AB = 
I,,. An inverse to A exists if and only if A is nonsingular. In this case it 
is determined uniquely, it is denoted by A~!, and it is inverse from both 
sides; that is, AA~' = A7'A = I. For the inverse of a product we have 
(AB) = Bota, 

Let us again consider a system Ax = b of m linear equations with n 
unknowns. For b = 0, the set of all solutions is easily seen to be a linear 
subspace of R”. Its dimension equals n minus the rank of A. In particular, 
for m = n (i.e., A is a square matrix), the system Ax = 0 has x = 0 as the 
only solution if and only if A is nonsingular. 

For b £ 0, the system Ax = b may or may not have a solution. If it has 
at least one solution, then the solution set is an affine subspace of R”, again 
of dimension n minus the rank of A. If m = n and A is nonsingular, then 
Ax = b always has exactly one solution. 


Row operations and Gaussian elimination. By an elementary row 
operation we mean one of the following two operations on a given matrix A: 


(a) Multiplying all entries in some row of A by a nonzero real number t. 
(b) Replacing the ith row of A by the sum of the ith row and jth row for 
some i Fj. 


Gaussian elimination is a systematic procedure that, given an input ma- 
trix A, performs a sequence of elementary row operations on it so that it is 
converted to a row echelon form. This means that the resulting matrix looks 
as in the following picture: 


1 


( 


Te 


m 


(the dots denote nonzero elements and the white region contains only 0’s). In 
words, there exists an integer r such that the rows 1 through r are nonzero, 
the remaining rows are all zero, and if j(i) = min{j : aj; 4 O}, then j(1) < 
j2) <8 HP). 

The rank of a matrix in this form is clearly r, and since elementary row 
operations preserve rank, this procedure can be used to compute the rank. 
It is also commonly used for finding all solutions to the system Ax = b: 
In this case, the matrix (A|b) (the matrix A with b appended as the last 
column) is converted to row echelon form, and from this, the solution set 
can be computed easily. A variant of Gaussian elimination can also be used 
to compute the inverse matrix, essentially by solving the n linear systems 
Ax =e;,i=1,2,...,n. 
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Determinants. Every square matrix A is assigned a number det(A) called 
the determinant of A. The determinant of A is defined by the formula 


n 


det(A) = S- sen(m) | | ain; 


TES i=l 


where the sum is over all permutations 7 of the set {1,2,...,n} and where 
sgn(7) denotes the sign of a permutation 7. The sign of any permutation is 
either +1 or —1, and it can be compactly defined as the sign of the expression 


I] (@G@)-7@). 


l<i<j<n 


For example, the determinant of a 2 x 2 matrix A equals a11a22 — a42421. 

We have det(A) 4 0 if and only if A is nonsingular. Cramer’s rule is a 
formula describing the (unique) solution of the linear system Ax = b with 
A square and nonsingular. It asserts that 


= det(Aj—b) 
J det(A) ’ 


where Aj» denotes the matrix A with the jth column replaced by the 
vector b. 

For any column index j, we have the following formula (the Laplace 
expansion of the determinant according to a column): 


det(A) = S Gites det(A”’), 


i=l 


where A” denotes the matrix arising from A by deleting the ith row and the 
jth column. 


Scalar product, Euclidean norm, orthogonality. The (standard) 
scalar product of two vectors x,y € R” is the number 2 y; + ray2 + 
--+ + anYn. We often write the scalar product as xy, although formally, 
x!y is a 1x1 matrix whose single entry is the scalar product. The Eu- 
clidean norm of a vector x € R” is denoted by ||x|| and defined by 
\|x|| = Vx?x = \/ai+---+ 22. The Euclidean distance of two vectors 
x and y is ||x — y||. 

Two vectors x, y € R” are called orthogonal if x7 y = 0. More generally, 
the angle of nonzero vectors x, y € R” is defined as the angle between 0 and 


xTy 
: Ix[I-Ilyll* . ; 

A square matrix A is called orthogonal if each column has Euclidean 
norm 1 and every two distinct columns are orthogonal. Equivalently, we have 
A-! = A’. From this one can also see that A is orthogonal if and only if 
A? is. 


am whose cosine equals 
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The orthogonal complement of a set X C R” is the set Xt = fy € 
R" : x’y = 0 for all x € X}. It is always a linear subspace of R”. If V is a 
linear subspace of R”, then dim(V) + dim(V*+) = n. 

If V is a linear subspace of R”, the orthogonal projection on V is a 
mapping R” — V that assigns to each x € R” a vector y € V such that x—y 
is orthogonal to all vectors in V. It can be shown that for every x such a y 
is unique and it is also the point of V that minimizes the Euclidean distance 
x — y among all points of V. 


Glossary or: What Was Neglected 


The theory of linear programming is huge, and many interesting things were 
not addressed in this book. In the glossary we briefly explain some of the 
common terms found elsewhere. This should help the reader in reading more 
advanced sources, say research papers, or in a conversation at a linear pro- 
gramming banquet. Our coverage is by no means complete, and to some 
extent, it is also guided by personal taste. 


Bounds for the variables. Linear programs occurring in practice often 
prescribe upper and lower bounds for (some of) the variables. For sim- 
plicity, let us assume that we are dealing with a linear program of the 


form es 
maximize cx 


subject to Ax=b 
0O<x<u, 


where wu; € RU {oo} for 7 =1,2,...,n. 
We have seen in Section 4.1 how to convert this linear program into equa- 
tional form, and after this conversion, we can solve it using the simplex 
method. The problem here is that the conversion generates many new 
variables and constraints, making the problem substantially larger and 
the computations within a single pivot step more expensive. 
A better way is to handle the constraints x < u implicitly in the simplex 
method. This is easy: During the selection of the leaving variable we 
also have to consider the possibility that the entering variable reaches its 
upper bound before any basic variable reaches one of its bounds. In this 
situation, the entering variable is simply set to its upper bound, and the 
basis B does not change at all. In general, every nonbasic variable attains 
one of its bounds, and “entering it” means to let its value change in the 
direction of the other bound. As before, if x7, = 0 in the current basic 
feasible solution, then a2, is a candidate for the entering variable if and 
only if the corresponding coefficient r; is positive. On the other hand, if 
Tp, = Uj, then x, is a candidate if and only if rj < 0. 
This scheme easily generalizes to bounds of the form €< x < u. 
Branch and bound isa general method for solving optimization problems 
by a clever enumeration of possible solutions. Here we describe how it 
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works for integer programs (Chapter 3). We suppose that the integer 
program is 


maximize c’x subject to Ax <b, x € Z", (ILP) 


and, for simplicity, that the polyhedron P = {x € R” : Ax < b} is 
bounded. As a first step, we solve the LP relaxation obtained by dropping 
the integrality constraints x € Z”. If the LP relaxation is infeasible, then 
(ILP) is infeasible as well and we stop. Otherwise, the LP relaxation has 
an optimal solution x*. If x* € Z”, we have already solved the program 
(ILP), and if x* ¢ Z", we chose some nonintegral component x; and 
“split” the integer program into two integer programs: The first one, 
(ILP<), is obtained from (ILP) by adding the constraint 7; < |2xjJ, 
while the second one, (ILP;), arises from (ILP) by adding the constraint 
xj > |v; | +1. This is the branching step. 

Since every feasible solution of (ILP) is feasible for exactly one of the 
programs (ILP<) and (ILP;), we have only to solve the latter two inte- 
ger programs to find an optimal solution of (ILP), if it exists. The “only” 
refers to the fact that both of these programs have strictly smaller fea- 
sible regions than the original one. To solve (ILP<), say, we proceed in 
the same fashion (solve the LP relaxation, split into two subproblems if 
necessary). 

In this way, we explore an implicitly given binary tree whose nodes cor- 
respond to subproblems of the original integer program. The exploration 
stops at a node when the LP relaxation becomes infeasible or has an in- 
tegral optimal solution. From the assumption that P is bounded it is not 
hard to show that this eventually happens along every exploration path. 
Therefore, the process terminates and computes an optimal solution of 
the “root program” (ILP), if there is one. 

So far this approach is missing the advertised cleverness, but here it 
comes: Whenever an integral solution of a subproblem has been discov- 
ered, its value (of the objective function) is a lower bound for the optimal 
value of (ILP). During exploration, we maintain the highest such lower 
bound z*. If at some subsequent node the optimal solution of the LP 
relaxation has objective function value at most z*, we can conclude that 
the subtree below that node need not be explored: No integral solution 
obtained from it can beat our current best solution with value z*. This 
is the bounding step. 

In the worst case the bounding step may not prune any subtree, but 
in many practical applications it results in enormous savings and allows 
for solving large integer programs. The effectiveness of branch and bound 
also depends on the choice of the nonintegral components 2; at the nodes, 
and on the order in which nodes are explored. More generally, the “axis- 
parallel” split according to the constraints 7j < |aj| and x; > [aj] +1 
may be replaced by a split along an arbitrary direction (branching on 
hyperplanes). 
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Branch and cut. This method combines the branch and bound technique 
with cutting planes. In many cases, branch and bound does not work 
well since the plain LP relaxation does not provide a useful upper bound 
for the optimal solution of an integer program (we have seen a simple 
example for that in Section 3.4). Before branching, one therefore tries to 
get a better upper bound by adding cutting planes to the LP relaxation. 
Sometimes, these are tailor-made for the problem at hand and cut off 
many more fractional solutions than general-purpose cutting planes like 
Gomory cuts. The branching starts only after all cutting planes from 
a predetermined set of candidate inequalities have been added. In the 
subproblems one proceeds similarly. 

Chvatal rank. We consider a polyhedron P C R”. If a? x < b, witha € Z” 
and b € Q, is some inequality satisfied by all x € P, then the inequality 


a’ x < |b| 


is satisfied by all integral points in P. This inequality is a Chudtal- 
Gomory cut of P. 

Let us define P’ as the set of points that satisfy all Chvétal-Gomory cuts 
of P. It follows that P’ > Pr, where Py is the convex hull of all integral 
points in P, the integer hull of P. 

Let us now assume that P is a rational polyhedron, i.e., one that can 
be described by an inequality system Ax < b with all components of A 
and b rational. In this case, one can show that P’ is again a polyhedron. 
Moreover, it is easy to see that P’ C P. We thus have 


P= PO > POM Dd p@d...D P,, 


where P(*) = (P&-Y) for k > 0. 

It is known that there is a finite number ¢ such that P® = Pr; the 
smallest such ¢ is the Chvatal rank of the rational polyhedron P. It is 
a measure of “nonintegrality” of P. Such a number ¢ even exists if P is 
nonrational but bounded. 

Column generation. We have pointed out in Section 7.1 that even linear 
programs with a very large (possibly infinite) number of constraints can 
efficiently be solved, by the ellipsoid method or interior point methods, 
if a separation oracle is available. Given a candidate solution s, such an 
oracle either certifies that s is feasible, or it returns a violated constraint. 
Let us consider the dual scenario—a linear program with a very large 
number of variables. Even if this linear program is too large to be ex- 
plicitly stored, we may be able to solve it using the simplex method. 
The crucial observation is that in every pivot step, the simplex method 
needs just one entering variable (and the tableau column associated with 
it) to proceed; see Section 5.6. If we have an improvement oracle that 
returns such a variable (or certifies that the current basic feasible solu- 
tion is optimal), we can still use the simplex method to solve the linear 
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program. This method is called (delayed) column generation. It can be 
used in applications as in Section 2.7 (paper cutting), for example, where 
the number of variables can quickly become very large. In fact, since an 
improvement oracle can be interpreted as a separation oracle for the dual 
linear program, we can also use the ellipsoid method to solve the linear 
program, using only a polynomial number of calls to the improvement 
oracle. 


Complementary slackness. The following corollary of the duality theo- 


rem is known as the theorem of complementary slackness. 
Let x* = (a},...,2%) be a feasible solution of the linear pro- 
gram 


maximize cx subject to Ax < b and x > 0, (P) 


and let y* = (yj,...,y%,) be a feasible solution of the dual linear 
program 


minimize b’ y subject to ATy > and y > 0. (D) 


Then the following two statements are equivalent: 

(i) x* is optimal for (P) and y* is optimal for (D). 

(ii) For alli = 1,2,...,m, x* satisfies the ith constraint of (P) 
with equality or y; = 0; similarly, for all 7 = 1,2,...,n, y* 
satisfies the jth constraint of (D) with equality or x; = 0. 

In words, statement (ii) means the following: If we pair up each (primal 
or dual) nonnegativity constraint with its corresponding (dual or primal) 
inequality, then at least one of the constraints in each pair is satisfied 
with equality (“has no slack”) at x* or y*. 

Complementary slackness is often encountered as a “combinatorial” proof 
of optimality, as opposed to the “numerical” proof obtained by comparing 
the values of the objective functions. We have come across complemen- 
tary slackness at various places, without calling it so: in the physical in- 
terpretation of duality (Section 6.2), in connection with the primal-dual 
central path (Section 7.2), and in the Karush—Kuhn—Tucker conditions 
(Section 8.7). 


Criss—cross method. This is yet another method for solving linear pro- 


grams in equational form. Like the simplex method and the dual simplex 
method, it goes through a sequence of simplex tableaus 


XB = p + Qxn 


z = z + r'’xn 


The criss-cross method only requires the set B to be a basis, meaning 
that the submatrix Ag is regular. This property allows us to form the 
tableau, but it guarantees neither p > 0 (as in the simplex method) nor 
r < 0 (as in the dual simplex method). 
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The criss-cross method has two types of pivot steps. A primal pivot step 
starts by choosing 6 € {1,2,...,n—m} and a € {1,2,...,m} such that 
rg > 0 and qag < 0. This is as in the simplex method, except that a 
does not have to satisfy a minimum ratio condition as in equation (5.3) 
of Section 5.6. The absence of this condition is explained by the fact that 
the criss-cross method does not have to maintain feasibility (p > 0). 

In a dual pivot step, a € {1,2,...,m} and @ © {1,2,...,n —m} are 
chosen such that pg < 0 and qag > 0. This is as in the dual simplex 
method, again without any minimum-ratio condition, since the criss— 
cross method does not maintain dual feasibility (r < 0) either. 

In both situations, B’ = (B\{ka})U{és} is the next basis, where gag 4 0 
guarantees that B’ is indeed a basis (see the proof of Lemma 5.6.1). 

If the linear program has an optimal solution, it can be shown that a 
basis either induces an optimal basic feasible solution (p > 0,r < 0) or 
allows for a primal or a dual pivot step. This means that the criss-cross 
method cannot get stuck. It can cycle, though, even if the linear program 
is not degenerate. 

However, there is a pivot rule for the criss-cross method that does not 
cycle. This rule is reminiscent of Bland’s rule for the simplex method 
and goes as follows. We let a* be the smallest value of a in any possible 
primal pivot step and (* the smallest value of 3 in any possible dual 
pivot step. If a* < 6*, then we perform a primal pivot step with a = a* 
and (3 as small as possible, and otherwise, we perform a dual pivot step 
with @ = G* and a as small as possible. 

Despite its simplicity and the fact that the computations can start from 
any basis (there is no need for an auxiliary problem), the criss-cross 
method is not used in practice, since it is slow. The feature that makes 
it theoretically appealing is that the computations depend only on the 
signs of the involved numbers, but not on their magnitudes. This allows 
the method to be generalized to situations beyond linear programming, 
where no concept of magnitude exists, such as “linear programming” over 
oriented matroids. 

Cutting plane. Given a system S of linear inequalities, another linear in- 
equality is called a cutting plane for S if it is satisfied by all integral 
solutions of S, but it is violated by some non-integral solution of S. 
A cutting plane for S exists if and only if the polyhedron corresponding 
to the S is nonintegral. The cutting plane “cuts off” a part of this poly- 
hedron that is free of integer points. 

Dantzig—Wolfe decomposition. Often the constraint matrix A of a lin- 
ear program has a special structure that can be exploited in order to 
solve the problem faster. The Dantzig—Wolfe decomposition is a particu- 
lar technique to do this within the framework of the simplex method. 
Given a linear program 


maximize c’x subject to Ax = b and x > 0 
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in equational form, we partition the m equations into a system A’x = b’ 
of “difficult” equations and a system A’x = b” of “easy” equations. How 
this is done depends on the concrete problem, but as a general guideline, 
the system A’x = b’ should contain the equations that involve many 
variables (“global constraints”), while the equations of A’x = b” are the 
ones with few variables (“local constraints”). Often there are only few 
global constraints and the local constraints consist of independent blocks 
in the sense that constraints in different blocks do not share variables. 

Let us assume for simplicity that the polyhedron P = {x € R”: A”x = 
b”’,x > 0} is bounded. Then we know that every x € P is a convex 


combination of the vertices vi, v2,...,v«K of P (Section 4.4). If we as- 
sume that we explicitly know these vertices, our linear program can be 
rewritten as a problem in K variables t1, t2,...,t« as follows: 


Maximize 4 tye vj 
subject to ea t;A'v; =b’ 


t; > 0 for all 7 =1,2,...,K. 


» 


This linear program typically has many fewer constraints than the one 
we started with, but many more variables. The smaller number of con- 
straints is an advantage, since then the linear equation systems that have 
to be solved during the pivot steps of the simplex method are smaller as 
well. However, the resulting savings usually do not justify the large num- 
ber of new variables. The trick that makes the approach efficient is to 
apply column generation: We can find an entering variable (and its as- 
sociated tableau column) without precomputing the vertices v;. Indeed, 
a vertex of P that yields an entering variable can be obtained as a basic 
feasible optimal solution of a suitable linear program with the “easy” set 
of constraints A”x = b” and x > 0. 


Devex is a pivot rule that efficiently approximates the STEEPEST EDGE 


pivot rule (Section 5.7), which in itself is somewhat costly to implement. 
Let us first recall STEEPEST EDGE: We want to choose the entering vari- 
able in such a way that the expression 


ef (Xnew = Xold) 


\|Xnew = Xolal| 


is maximized, where Xoiq and Xpew are the current basic feasible solu- 
tion and the next one, respectively. As usual, we let p,Q,r,2o be the 
parameters of the simplex tableau corresponding to the current feasible 
basis B, and we write B = {ki,ko,...,km} and N = {1,2...,n}\B= 
{015 boca sy laa} where ky < kg < +++ < km and (1 < lo < +++ < lnm 
(see Section 5.5). 

Let us assume that the entering variable is x,, and that v = €g. Moreover, 
let us suppose that x, has value t > 0 in Xnew. 
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With x = Xo1q, we then have 
T at 2 ae 
Cc (Knew = Xold) =r (XN + teg _ Xn) = trg > 0, 
since rg > 0. We also get 


m 
x + i(e, + > aiver —x 


i=l 


|| Xnew = Xolal| a 


m 1/2 
=t(1+ oa) 
w=1 


This implies 


ce Xnew — Xold ue 
i=l 


\|Xnew <* Xolal| 


In order to find the entering variable that maximizes this ratio, we have 
to know the entries of Q in all columns corresponding to indices @ with 
rg > 0. This requires a large number of extra computations in the usual 
implementation of the simplex method that does not explicitly maintain 
the tableau (see Section 5.6). According to Lemma 5.5.1, the computation 
of a single column of Q requires O(m7) arithmetic operations, and we may 
have to look at many of the n — m columns. 

The number of arithmetic operations can be brought down to O(m) per 
column, by maintaining and updating the values 


m 
i=l 


in every pivot step. If x, is the leaving variable, the corresponding value 
in the next iteration is 


2 m 
daj daj 
i (4) Tp —2— dig Up, (G.1) 
j dang B dag 2, JB 


and this value can be computed with O(m) arithmetic operations for a 
single j, after a preprocessing that involves O(m?7) arithmetic operations. 
The Devex pivot rule differs from this procedure as follows. First, it main- 
tains a reference framework of only n—™m variables. For Tj, a value Gis is 
taken into account only if the variable x,, is in the reference framework. 
This has the effect that steepness of edges is measured only in the sub- 
space spanned by the n—™m variables in the reference framework, and not 
in the space of all n variables. The major advantage of this approxima- 
tion is that it is easy to set up: We have 7; = 1 for all 7 if the reference 
framework is initially chosen as the current set of nonbasic variables. 
Second, Devex maintains only an approximation T; of Tj, which is up- 
dated as follows: After each pivot step, T; is set to 
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This avoids the expensive computation of the sum in (G.1) and replaces 
the sum of the first two terms by their maximum. That way, the T; 
steadily grow and at some point they become bad approximations. This 
makes it necessary to reset the reference framework to the current set of 
nonbasic variables from time to time, meaning that all T; are reset to 1. 
Details and heuristic justifications for the simplified update formula are 
given in 

P.M. J. Harris: Pivot selection methods of the Devex LP code, 

Math. Prog. 5(1973) 1-28. 


Dual simplex method. This is a method for solving linear programs in 


equational form. Like the simplex method, it goes through a sequence of 
simplex tableaus 


xp =p + Qxn 


Z = 2 +r ’xy 


until a tableau with p > 0, r < O is encountered. Such a tableau proves 
that the vector x* given by xz = p, xy = 0 is optimal. While the simplex 
method maintains the invariant p > 0 (B is a feasible basis), the dual 
simplex method maintains the invariant r < 0 (B is a dual feasible basis). 
As long as there are indices 7 such that p; < 0, the dual simplex method 
chooses a leaving variable 7, = x, with pa < 0. Then it searches for 
an entering variable x, = x¢ 3 whose increment results in x, = 0. This is 
possible only if gag > 0. Moreover, in the tableau corresponding to the 
next basis B’ = (B \ {u}) U {v}, all coefficients of nonbasic variables in 
the last row should still be nonpositive. Rewriting the tableau as usual, 
we find that the next basis B’ is dual feasible if and only if 3 satisfies 


dag > 0 and ane = max { “2: an >0,7= 12. 4n—mb, 
dap aj 

If the dual simplex method does not cycle, it will eventually reach p > 0 
(an optimal solution is found), or it encounters a situation in which all 
daj are nonpositive. But then the linear program under consideration is 
infeasible. 

The reader might have noticed that the computations involved in a pivot 
step of the dual simplex method look very similar to those in the (primal) 
simplex method explained in Section 5.6. Indeed, they are the computa- 
tions of the primal simplex method applied to the “dual” tableau 


yn =-r — Qlys 
Z = —%9 — P’ ys 


One can in fact show that the dual simplex method is just the primal 
simplex method in disguise, applied to the dual linear program, and so the 
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dual simplex method is not really a new tool for solving linear programs. 
However, it is useful in practice, since it works with the original linear 
program (as opposed to the dual linear program), and most simplex- 
based linear programming solvers offer an option of using it. For certain 
classes of problems it can be substantially faster than the primal simplex 
method. 

The dual simplex method is also useful after adding cutting planes, since 
the search for the new solution can start from the old one (which remains 
dual feasible). 

Gomory cut. We consider a linear program 


maximize ce’ x subject to Ax = b, x > 0 


in equational form with c € Z”, along with an optimal basic feasible 
solution x*. A Gomory cut is a specific cutting plane for this linear pro- 
gram, derived from a feasible basis B associated with x*. Given B, we 


can rewrite the equations Ax = b and z = c’ x in tableau form 
xp =p + Qxn 
z =2 + xy 


(see Section 5.5). It is easy to show that if for some 7 € {1,2,...,m}, the 
value p; = x¥ is nonintegral, then the inequality 


ri < [pi] + lailxn 


is a cutting plane, where q; is the ith row of Q. This cutting plane is 
called a Gomory cut. A special Gomory cut is obtained if zo ¢ Z: 


z=c'x < |zo| + [r7]xn. 


We may now add the Gomory cuts as new inequalities to the linear 
program and recompute the optimal solution. Let us call this a round. The 
remarkable property of Gomory cuts is that we get an integral optimal 
solution after a finite number of rounds (assuming rational data). Since 
we never cut off integral solutions of the original linear program, this 
final solution is an optimal solution of the integer program 


maximize ce’ x subject to Ax =b, x > 0, x EZ”. 


The method of Gomory cuts is a simple but inefficient algorithm for 
solving integer programs. 

Phases I and II. Traditionally, the computation of the simplex method is 
subdivided into phase I and phase IT. In phase I, the auxiliary linear 
program for finding an initial feasible basis is solved, and phase II solves 
the original linear program, starting from this initial feasible basis (Sec- 
tion 5.6). 
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Phase transition. Even if phase I of the simplex method reveals that the 
original problem is feasible, it may happen that some of the auxiliary 
variables are still basic (with value 0) in the final feasible basis of the 
auxiliary problem. We have indicated in Section 5.6 that it is nevertheless 
easy to get a feasible basis for the original problem. 

In the simplex method this part can elegantly be implemented by pivot 
steps of a special kind, in which the possible leftovers of the auxiliary 
variables 2,41 through t+ are forced to leave the basis. These pivot 
steps are said to form the phase transition. 

Under our assumption that the matrix A has rank m, the phase transition 
is guaranteed to succeed. But even the case in which A has rank smaller 
than m can be handled during these special pivot steps, and this is an 
important aspect in practice, since it allows the simplex method to work 
with any input. If A does not have full rank, the consequence is that 
some of the auxiliary variables 2,4, through %,4m cannot be forced to 
leave the basis. But any such variable can be shown to correspond to a 
redundant constraint, one that may be deleted from the linear program 
without changing the set of feasible solutions. Moreover, after deleting all 
the redundant constraints discovered in this way, the resulting subsystem 
A’x = b’ has a matrix A’ of full rank. A basis of the linear program with 
respect to this new system is obtained from the current basis by removing 
the auxiliary variables that could not be forced to leave. 

Pivot column. The column of the simplex tableau corresponding to the 
entering variable is called a pivot column. Depending on the context, 
this may or may not include the coefficient of the vector r corresponding 
to the entering variable. 

Pivot element. This is the element of the simplex tableau in the pivot row 
and pivot column. 

Pivot row. The row of the simplex tableau corresponding to the leaving 
variable is called the pivot row. Depending on the context, this may or 
may not include the coefficient of the vector p that holds the value of the 
leaving variable. 

Pricing. The process of selecting the entering variable during a pivot step 
of the simplex method is sometimes referred to as pricing. We say that 
a nonbasic variable xe, is priced when its coefficient r; in the last row of 
the simplex tableau is computed (Section 5.6). 

Primal—dual method. This is a method for solving a linear program by 
iteratively improving a feasible solution of the dual linear program. Let 
us start with a linear program in equational form: 


Maximize ce? x subject to Ax = b and x > 0. (P) 


We assume that b > 0 and that (P) is bounded. The dual linear program 
is 
minimize b’y subject to A7y > ce. (D) 
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Let us assume that we have a feasible solution y of (D). Then we define 
J={j€ {1,2,...,n}: ary =c;}, 


where a; denotes the jth column of A. It turns out that y is an optimal 
solution of (D) if and only if there exists a feasible solution x of (P) such 
that 

&; =0 for all j € {1,2,...,n}\J (G.2) 


(this can easily be checked directly or seen from the complementary slack- 
ness theorem). To check for the existence of such a vector x, we solve an 
auxiliary linear program, the restricted primal: 


Minimize Zytzot---+2m 
subject to Ajxj;+Inz=b (RP) 
x,z> 0. 


By our assumption b > 0, (RP) is feasible, and it is also bounded. Let 
x*,z* be an optimal solution. If z* = 0, then x* optimally solves (P), 
and y optimally solves (D). Otherwise, we know that y cannot be an 
optimal solution of (D), and we want to find a better dual solution. To 
this end, we consider the dual of (RP), which can be written as 


maximize b/y 
subject to (A;)Ty <0 (RD) 
y <1. 


Let y* be an optimal solution of (RD). From the duality theorem we 
know that b?y* = )>;", z > 0. Consequently, for every t > 0, the 
vector y — ty* is an improved dual solution, provided that it is feasible 
for (D). We claim that there exists a small t > 0 such that y—ty* actually 
is feasible for (D). Indeed, for j € J, we have 


ay (¥ — ty") > cj +t-0, 


and for j € {1,2,...,n}\ J, we get 


ay (¥ — ty") >c;- tayy* > 6; 

for a suitable t. Now we choose ¢* as large as possible such that the last 
inequality still holds for all j € {1,2,...,n}\ J, and we replace y by 
y —t*y” for the next iteration of the primal—dual method. The set J will 
change as well, since at least one inequality that previously had slack has 
now become tight. Note that ¢* exists, since otherwise, (D) is unbounded 
and (P) is infeasible, in contradiction to our assumption b > 0. 

With some care, this method yields an optimal solution of (D), and hence 
an optimal solution x* to (RP) and (P), after a finite number of iterations. 
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This is not a priori clear: Even if the dual objective function strictly 
decreases in every iteration, the improvement could become smaller and 
smaller; in fact, there are examples in which this happens, and in which 
sets J reappear infinitely often. But this can be overcome by choosing 
for example the lexicographically largest optimal solution y* of (RD). 
There are two aspects that make the primal-dual method attractive. 
First of all, the restricted primal (RP) and its dual (RD) do not involve 
the original objective function vector c. This means that the primal— 
dual method reduces an optimization problem to a sequence of decision 
problems. These decision problems are in many cases much easier to 
solve and do not require linear programming techniques. For instance, 
if (P) is the linear program that computes a maximum-weight matching 
in a bipartite graph (Section 3.2), then (RP) computes a matching of 
maximum cardinality in the subgraph with edges indexed by J. Moreover, 
if (RP) has been solved once, its solution in the next iteration is easy 
to get through an augmenting path. These insights form the basis of 
the Hungarian method for maximum-weight matchings, which can be 
interpreted as a sophisticated implementation of the primal—dual method. 
A second aspect appears in connection with approximation algorithms 
for NP-hard problems. Starting from the LP relaxation (P) of a suitable 
integer program, we may search for a vector x that satisfies a condition 
weaker than (G.2), but in return it allows us to construct a related inte- 
gral feasible solution x* to (P). If we cannot find x, we know as before 
that the dual solution y can be improved. If we find x, then y may not 
yet be optimal, but we may still be able to argue that x* is a reasonably 
good solution of the integer program. A number of approximation algo- 
rithms based on the primal—dual method are described in 

M. X. Goemans and D .P. Williamson: The primal—dual method 

for approximation algorithms and its application in network de- 

sign problems, in Approximation Algorithms (D. Hochbaum, ed- 

itor), PWS Publishing Company, Boston, 1997, pages 144-191. 


Ratio test. The process of selecting the leaving variable during a pivot step 


of the simplex method is called the ratio test. The leaving variable x,,, 


is such that it has minimum ratio = (Section 5.6). 


Reduced costs. The vector r in a simplex tableau; r; is the reduced cost 


of variable x, (Section 5.5). 


Sensitivity analysis. The components of the matrix A and the vectors b 


and c that define a linear program are often results of measurements 
or estimates. Then an important question is how “stable” the optimal 
solution x* is. Ideally, if the components vary by small amounts, then x* 
(or at least c?x*) varies by a small amount as well. In this case small 
errors in collecting data can safely be ignored. 

It may well be that small changes in some of the components have a 
more drastic impact on x*. Sensitivity analysis tries to assess how the 
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solution depends on the input, both by theoretical considerations and by 
computer simulations. 
The simplex method and the dual simplex method are excellent tools for 
the latter task, given that only b and c vary. If ¢ varies, we can start 
from the old optimal solution x* and reoptimize with respect to the new 
objective function vector c. If ¢ changed only a little, chances are good 
that this reoptimization requires only few pivot steps. 
Similarly, if b changes, the dual simplex method may be used to reop- 
timize, since the old optimal solution x* is still associated with a dual 
feasible basis under the new right-hand-side vector b. 
The primal simplex method can also efficiently deal with the addition of 
new variables, while the dual simplex method can handle the addition of 
constraints. These two operations in particular occur in connection with 
column generation and cutting planes. 

Total dual integrality. A system Ax < b of linear inequalities is said to 
be totally dual integral (TDI) if the linear program 


minimize by subject to Ay =c, y >0 


has an integral optimal solution y whenever c is an integral vector for 
which an optimal solution exists. It can then be shown that the (primal) 
problem 

maximize ce’ x subject to Ax <b 


has an integral optimal solution for all c, provided that A is rational and 
b is integral. 

Under total dual integrality of Ax < b, the set P = {x : Ax < b} is 
therefore an integral polyhedron (provided A is rational and b is inte- 
gral). We first met integral polyhedra in Section 8.2, where we used the 
concept of total unimodularity to establish integrality. The TDI notion 
is connected to total unimodularity as follows. A matrix A is totally uni- 
modular if and only if the system {Ax < b,x > 0} is TDI for all integral 
vectors b. 

One of the most prominent examples of a TDI system is Edmonds’ de- 
scription of the matching polytope by inequalities. For a (not neces- 
sarily bipartite) graph G = (V, FE), the matching polytope is the convex 
hull of all incidence vectors of matchings. As in Section 3.2, the incidence 
vector of a matching M C E is the |E£|-dimensional vector x with 


_f 1 ifeeM, 
te Q otherwise. 


Any such vector satisfies the inequalities 


Le 2 
sp Te < > UE. 
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If G is a bipartite graph, we have shown that the polytope P defined 
by these inequalities is integral (Section 8.2, Lemmas 8.2.5 and 8.2.4). It 
follows that every vertex of P is the incidence vector of some matching. 
Consequently, P coincides with the matching polytope in the bipartite 
case, and the system (G.3) is an explicit description of the matching 
polytope by inequalities. 

This is no longer true for nonbipartite graphs. Indeed, if we consider the 
triangle 


then the polytope P defined by (G.3) has the five vertices 0, e1, e2, es, 
and $(e; + e2 + e3) and is therefore not integral: 


€3 


ye NN 


el 


Yet there is a larger inequality system that leads to an integral polytope. 
The additional inequalities stem from the observation that every subset 
A CV of odd size 2k+1 supports at most k edges of any fixed matching. 
Therefore, every incidence vector of a matching satisfies all inequalities 
in the following system (the odd subset inequalities): 


Aad 
See | LC ; for all A C E with |A| odd. (G.4) 


It can be shown that the system of inequalities in (G.3) and (G.4) is TDI, 
and consequently, these inequalities define the matching polytope for a 
general graph G. 

In the case of a triangle, the only nontrivial inequality in (G.4) is obtained 


for A= V: 

x He ed be 

eck 
This inequality cuts off the fractional vertex $(e; + e2 + e3) and leaves 
the tetrahedron that is the convex hull of 0, e;, eg, and es. This integral 
polytope is the matching polytope of the triangle. 
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We now have an exponentially large but explicit inequality description 
of the matching polytope of a general graph. Having such a description 
for an integral polyhedron is an indication that it might be possible to 
optimize a linear function over the polyhedron in polynomial time. In 
the case of matchings in general graphs, this can indeed be done. A 
maximum-weight matching in a general graph can be found in polynomial 
time, but the known algorithms are much more involved than those for 
the bipartite case discussed in Section 3.2. 
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O (a vector of all 0’s), 195 

1 (a vector of all 1’s), 195 

A? (transposed matrix), 197 
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Ag (columns of A indexed by B), 
44 

Ax <b (a system of linear 
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I, (identity matrix), 197 
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Vf (gradient), 186 
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158(8.4.1) 

affine map, 108 

affine scaling method, 116 

affine subspace, 195 
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— Bose—Mesner, 164 

— matrix, 164 

algorithm 


217 


— approximation, 151 

— greedy, 32, 38 

— in the Twentieth century, 8 
— Karmakar’s, 115 

— polynomial, 107 

angle (of vectors), 199 
approximation algorithm, 151 
auxiliary linear program, 209 


ball, smallest enclosing, 184 

basic feasible solution, 44, 
54(4.4.2) 

basic variable, 45 

basis, 196 

— feasible, 46 

basis pursuit, 169, 175 

battle of the sexes, 141 

best response, 134 

bimatrix game, 140 

binomial coefficient, 46 

Birkhoff polytope, 52 

bit size, 106 

Bland’s rule, 72 

— finiteness, 73(5.8.1) 

Blotto, Colonel (game), 131 

Borel, Emile, 139 

Bose—Mesner algebra, 164 

bound 

— Delsarte, 159(8.4.3) 

— sphere-packing, 159(8.4.2) 

bounds for variables, 201 

BP-exactness, 170 

branch and bound, 31, 201 

branch and cut, 31, 203 
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central path method, 116 

Chvatal rank, 203 

Chvatal-Gomory cut, 203 

code, error-correcting, 156, 167 

Colonel Blotto (game), 131 

column generation, 203 

column space, 197 

combination 

— convex, 49 

— linear, 195 

combinatorics, polyhedral, 31 

complement, orthogonal, 200 

complementarity, strict, 128 

complementary slackness, 204 

complexity, smoothed, 78 

conditions, Karush—Kuhn—Tucker, 
187(8.7.2) 

conjecture, Hirsch, 78 

constraint, 3 

— nonnegativity, 41 

convex combination, 49 

convex function, 49 

convex hull, 49 

convex polyhedron, 8, 51 

convex polytope, 51 

convex program, 185 

— quadratic, 184 

convex programming, 185 

convex set, 48 

cover, vertex, 37 

Cramer’s rule, 199 

criss—cross method, 204 

crosspolytope, 52 

cube, 51 

— Klee—Minty, 76, 130 

cut, maximum, 114 

cutting planes, 31, 35, 205 

cycling, 63, 72 


d-interval, 178 

dy(-,-) (Hamming distance), 157 
Dantzig, George, 9, 13 
degenerate linear program, 62 
Delsarte bound, 159(8.4.3) 


determinant, 199 

Devex pivot rule, 206 

diagonal matrix, 197 

dietary problem, 12 

dilemma, prisoner’s, 142 

dimension, 196 

dimension, of a convex 
polyhedron, 51 

distance 

— Euclidean, 199 

— Hamming, 157 

— of a code, 158(8.4.1) 

— point from a line, 24 

distribution, normal, 170(8.5.2) 

dual linear program, 82 

dual simplex method, 105, 208 

duality theorem, 9, 83 

— and physics, 86 

— proof, 94, 127 

— weak, 83(6.1.1) 


e; (ith vector of the standard 
basis), 196 

Easter Bunny, 138 

edge (of a polyhedron), 54 

elimination 

— Fourier—Motzkin, 100 

— Gaussian, 198 

— — polynomiality, 107 

ellipsoid, 108 

ellipsoid method, 106 

embedding, self-dual, 125 

entering variable, 67 

epigraph (of a function), 49 

equational form, 41 

equilibrium, Nash 

— mixed, 135(8.1.1) 

— pure, 135 

error-correcting code, 156, 167 

Euclidean distance, 199 

Euclidean norm, 199 

expansion of a determinant, 199 

extremal point, 55 


face (of a polyhedron), 53 


Farkas lemma, 89-104 
feasible basis, 46 

feasible solution, 3 

— basic, 44, 54(4.4.2) 
fitting a line, 19 

flow, 11, 14 

form 

— equational, 41 

— standard, 41 
Fourier—Motzkin elimination, 100 
fractional matching, 182 
fractional transversal, 182 
function 

— convex, 49 

— objective, 3 

— strictly convex, 49 


Gallai’s theorem, 178 
game 

— battle of the sexes, 141 
— bimatrix, 140 

— Colonel Blotto, 131 

— rock-paper-scissors, 133 
— — modified, 138 

— value, 136 

— zero-sum, 131 

Gaussian elimination, 198 
— polynomiality, 107 
Goldman—Tucker system, 126 
Gomory cut, 209 

greedy algorithm, 32, 38 


half-space, 50 

Hall’s theorem, 144(8.2.1) 
Hamming distance, 157 
Hamming, Richard, 156 
Helly’s theorem, 177 
Hirsch conjecture, 78 
hull, convex, 49 
hyperplane, 50 


I, (identity matrix), 197 

identity matrix, 197 

incidence matrix (of a graph), 146 
independence, linear, 195 
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independent set, 39 

infeasible linear program, 4, 63 
integer hull, 203 

integer programming, 29 

— mixed, 31 

integral polyhedron, 148 
integrality, total dual, 213 
interior point method, 115 
d-interval, 178 

inverse matrix, 198 


JPEG (encoding), 176 


K6nig’s theorem, 144(8.2.2) 
Kantorovich, Leonid, 9 
Karmakar’s algorithm, 115 
Karush—Kuhn—Tucker conditions, 
187(8.7.2) 
Khachyian, Leonid, 106 
Klee—Minty cube, 76, 130 
Koopmans, Tjalling, 9 


é;-norm, 169 

Lagrange multipliers, 119, 188 

Laplace expansion, 199 

largest coefficient pivot rule, 71 

largest increase pivot rule, 71 

leaving variable, 67 

Lemke—Howson algorithm, 140 

lemma, Farkas, 89-104 

lexicographic ordering, 73 

lexicographic rule, 72 

line fitting, 19 

linear combination, 195 

linear inequality, system, 7, 101, 
109 

linear program, 1 

— degenerate, 62 

— dual, 82 

— infeasible, 4, 63 

— unbounded, 4, 61 

linear programming, meaning, 1 

linear span, 196 

linear subspace, 195 

linearly independent, 195 

LP relaxation, 33 
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makespan, 149 

map, affine, 108 

marriage theorem, 144(8.2.1) 

matching 

— fractional, 182 

— maximum-weight, 31, 147 

— perfect, 33 

matching number, 181 

matching polytope, 213 

matrix, 196 

— diagonal, 197 

— identity, 197 

— inverse, 198 

— multiplication, 196 

— nonsingular, 197 

— orthogonal, 199 

— payoff, 132 

— skew-symmetric, 127 

— totally unimodular, 144 

— transposed, 197 

matrix algebra, 164 

matroid, oriented, 205 

max-flow-min-cut theorem, 148 

maximum cut, 114 

maximum-weight matching, 31, 
147 

method 

— affine scaling, 116 

— central path, 116 

— dual simplex, 208 

— ellipsoid, 106 

— interior point, 115 

— least squares, 20 

— potential reduction, 116 

— primal—dual, 105, 210 

— simplex, 8, 57 

— — dual, 105 

— — efficiency, 76 

— — history, 8, 13 

— — revised, 70 

minimax theorem, 148 


— for zero-sum games, 136(8.1.3) 


minimum bounding sphere, 184 
minimum vertex cover, 37 


mixed integer programming, 31 

mixed strategy, 134 

multiplier 

— Karush—Kuhn—Tucker, 
187(8.7.2) 

— Lagrange, 119, 188 


Nash equilibrium 

— mixed, 135(8.1.1) 

— pure, 135 

network flow, 11, 14 
Neumann, John von, 140 
nonbasic variable, 45 
nonsingular matrix, 197 
norm 

— Euclidean, 199 

— ¢,, 169 

normal distribution, 170(8.5.2) 
NP-hard problem, 9 
number 

— matching, 181 

— transversal, 181 


objective function, 3 
octant, 42 

operation, row, 198 
optimal solution, 3 

— existence, 46 

optimum, 3 

oracle, separation, 113 
ordering, lexicographic, 73 
oriented matroid, 205 
orthant, 42 

orthogonal complement, 200 
orthogonal matrix, 199 
orthogonal projection, 200 
orthogonal vectors, 199 


path, central, 118 
payoff matrix, 132, 134 
perfect matching, 33 
phase transition, 210 
phases I and II, 209 
physics and duality, 86 
pivot column, 210 
pivot element, 210 


pivot row, 210 

pivot rule, 71 

— Bland’s, 72 

— Devex, 206 

— largest coefficient, 71 
— largest increase, 71 

— randomized, 72, 77 

— steepest edge, 71 

— — implementation, 206 
pivot step, 59, 67 

planes, cutting, 35 

point, extremal, 55 
polyhedral combinatorics, 31 
polyhedron 

— convex, 51 

— integer hull, 203 

— integral, 148 

— rational, 203 
polynomial algorithm, 107 


polytope 
— Birkhoff, 52 
— convex, 51 


— matching, 213 


potential reduction method, 116 


pricing, 210 


primal—dual method, 105, 210 


prisoner’s dilemma, 142 
problem 

— dietary, 12 

— NP-hard, 9 

product, scalar, 199 
program 

— convex, 185 

— — quadratic, 184 

— integer, 30 

— linear, 1 
programming 

— convex, 185 

— integer, 29 

— linear, meaning, 1 
— semidefinite, 114 
projection, orthogonal, 200 
pure strategy, 134 


quadrant, 42 
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randomized pivot rule, 72, 77 
rank, 197 

ratio test, 212 

rational polyhedron, 203 
reduced cost, 212 

relaxation, LP, 33 

revised simplex method, 70 
rock-paper-scissors game, 133 
— modified, 138 


row operation, elementary, 198 


rule 

— Bland’s, 72 

— — finiteness, 73(5.8.1) 
— Cramer’s, 199 

— Devex, 206 

— largest coefficient, 71 
— largest increase, 71 

— pivot, 71 

— randomized, 72, 77 

— steepest edge, 71 

— — implementation, 206 


Santa Claus, 138 

scalar product, 199 
scheduling, 148 

self-dual embedding, 125 


semidefinite programming, 114 


sensitivity analysis, 212 
separation of points, 21 
separation oracle, 113 
set 

— convex, 48 

— independent, 39 
signal processing, 176 
simplex, 52 

simplex method, 8, 57 
— and a simplex, 52 

— dual, 105, 208 

— efficiency, 76 

— history, 8, 13 

— lexicographic rule, 72 
— origin of the term, 52 
— phase transition, 210 
— phases I and II, 209 
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— pricing, 210 

— ratio test, 212 

— reduced costs, 212 

— revised, 70 

simplex tableau, 58, 65 

— pivot column, 210 

— pivot element, 210 

— pivot row, 210 

— reduced cost vector, 212 

size, bit, 106 

skew-symmetric matrix, 127 

slack variable, 42 

slackness, complementary, 204 

smallest enclosing ball, 184 

smoothed complexity, 78 

solution 

— feasible, 3 

— — basic, 44, 54(4.4.2) 

— optimal, 3 

— — existence, 46 

— sparse, 168 

space 

— column, 197 

span, linear, 196 

sparse solution, 168 

sphere-packing bound, 159(8.4.2) 

square matrix, 197 

standard basis, 196 

standard form, 41 

steepest edge pivot rule, 71 

— implementation, 206 

step, pivot, 59, 67 

strategy, mixed, 134 

strategy, pure, 134 

strict complementarity, 128 

strictly convex function, 49 

strongly polynomial algorithm, 
108 

submatrix, 196 

subspace 

— affine, 195 

— linear, 195 

symmetric zero-sum game, 140 

system 

— Goldman-—Tucker, 126 

— linear, underdetermined, 167 


— of linear inequalities, 7, 101, 
109 


tableau, simplex, 58, 65 

theorem 

— complementary slackness, 204 

— duality, 9, 83 

— — proof, 94, 127 

— — weak, 83(6.1.1) 

— Gallai’s, 178 

— Hall’s, 144(8.2.1) 

— Helly’s, 177 

— KO6nig’s, 144(8.2.2) 

— marriage, 144(8.2.1) 

— minimax, 148 

— — for zero-sum games, 
136(8.1.3) 

TIT FOR TAT, 142 

total dual integrality, 213 

totally unimodular matrix, 144 

transposed matrix, 197 

transversal, 178 

— fractional, 182 

transversal number, 181 


unbounded linear program, 4, 61 
underdetermined linear system, 
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value, of a game, 136 
variable 
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— leaving, 67 

— slack, 42 

vector, 195 

vertex (of a polyhedron), 53 
vertex cover, 37 

— minimum, 37 


weak duality theorem, 83(6.1.1) 
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Z (integers), 30 
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— symmetric, 140 
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